Types of models to analyze data

Data can be analyzed in several different ways in an attempt to determine the magnitude of hypothesised causal relationships between variables (possible risk factors). Note, these models will not ever prove a causal relationship just as it is not possible to prove a theory, only disprove it.

Neural nets

Neural Nets - look for patterns within data sets between several variables associated with a set of individuals. They can find correlations within data sets, and make predictions of where a new observation might lie based on values for the conditioning variables, but they do not have a causal interpretation and tend to be rather black-box in nature. Neural nets are used a lot in profiling. For example, they are used to estimate the level of credit risk associated with a credit card or mortgage applicant, or identify a possible terrorist or smuggler at an airport. They don't seek to determine why a person might be a poor credit risk, for example, just match the typical behaviour or history of someone who fails to pay their bills - things like having defaulted before, changing jobs frequently, not owning a home;

Classification trees

Classification trees - can be used to break down case-control data to list from the top down the most important factors influencing the outcome of interest. This is done by looking at the difference in fraction of cases and controls that have the outcome of interest (e.g. disease) when they are split by each possible explanatory variable. So, for example, having a case-control study of lung cancer one might find the fraction of people with lung cancer is much larger among smokers than among non-smokers, which forms the first fork in the tree. Looking then at the non-smokers only, one might find that the fraction of people with lung cancer is much higher for those who worked in a smoky environment compared with those that did not. One continually breaks down the population splits, figuring out which variable is the next most correlated with a difference in the risk until you run out of variables or statistical significance;

Regression models

Logistic regression is used a lot to determine whether there is a possible relationship between variables in a data set and the variable to be predicted. The probability of a 'success' (e.g. exhibiting the disease) of a dichotomous (two possible outcomes) variable we wish to predict p_i is related to the various possible influencing variables by regression equations - for example:

where the sub-script i refers to each observation, subscript j refers to each possible explanatory variable in the data set, of which there are k in total. Step-wise regression is used in two forms: forward selection starts off with no predictive variables and sequentially adds them until there is no statistically significant improvement in matching the data, backward selection has all variables in the pot and keeps taking away the least significant variable until the model's statistical predictive capability begins to suffer. Logistic regression can take account of important correlations between possible risk factors by including covariance terms. Like neural nets it has no in-built causal thinking.

Bayesian belief networks

Bayesian Belief Networks (aka Directed Acyclic Graphs) - visually, these are networks of nodes (observed variables) connected together by arcs (probabilistic relationships). They offer the closest connection to causal inference thinking. In principle you could let DAG software run on a set of data and come up with a set of conditional probabilities - it sounds appealing and objectively hands-off, but the networks need the benefit of human experience to know the direction in which these arcs should go, i.e. what the directions of influence really are (and if they exist at all). I'm a firm believer in assigning some constraints to what the model should test, but make sure you know why you are applying those constraints. To quote Judea Pearl (Pearl, 2001): 'compliance with human intuition has been the ultimate criterion of adequacy in every philosophical study of causation, and the proper incorporation of background information into statistical studies likewise relies on accurate interpretation of causal judgment'.

Commercial software packages are available for each of these methods. The algorithms they use are often proprietary, and can give different results on the same data sets, which is rather frustrating and presents some opportunities to those who are looking for a particular answer (don't do that). In all of the above techniques, it is important to split your data up into a training set and a validation set to test whether the relationships that the software find in the training set will let you reasonably accurately (i.e. at the decision-maker's required accuracy) predict the outcome observations in the validation data set. Best practice involves repeated random splitting of your data into training and validation sets.

Types of models to analyze data

Neural nets

Classification trees

Regression models

Bayesian belief networks

Navigation