Testing and modeling causal relationships introduction | Vose Software

Testing and modeling causal relationships introduction

See also: Introduction to risk analysis, Types of models to analyze data

In this section we look at some practical issues of causality from a risk analysis perspective. There are a few simple, very practical and intuitive rules that will help you test a hypothesized causal relationship.

Causal inference is mostly applied to health issues, although the thinking has potential applications in other areas like econometrics, so we will use health issues as examples in this topic. We can attempt to use a causal model to answer three different types of questions:

  • Predictions - what will happen given a certain set of conditions

  • Interventions - what would be the effect of controlling one or more conditions?

  • Counterfactuals - what would have happened differently if one or more conditions had been different

In a deterministic (non-random) world there is a straightforward interpretation to causality. CSI Miami and derivatives, and all those medical dramas, are such fun programs because we viewers try to figure out what really happened - what caused this week's murder(s), and of course the program always finished with a satisfyingly unequivocal solution.

In the risk analysis world we have to work with causal relationships that are usually probabilistic in nature, for example:

  • The probability of having lung cancer within your life is x if you smoke.

  • The probability of having lung cancer within your life is y if you don't smoke.

We all know that x > y which makes being a smoker a risk factor. But life is more complicated than that: there is a biological gradient, meaning in this case the more you smoke, the more likely the cancer. If we were to do a study designed to determine the causal relationship between smoking and cancer we should look not just at whether people smoked at all, but at how much a person has smoked, for how long, and in what way (cigars, cigarettes with or without filters, pipes, little puffs or deep inhaling, brand, etc).  Things are further complicated because people can change their smoking habits over time. How about:

  • The probability of having lung cancer within your life is a if you carry matches.

  • The probability of having lung cancer within your life is b if you don't carry matches.

If it would be researched, it is likely that the outcome would be a > b although carrying matches should not be a risk factor. A correct statistical analysis will determine the high correlation between carrying matches (or lighters) and using tobacco products. A sensible statistician would figure out that matches should be removed from the analysis. An uncontrolled statistical analysis can produce some silly results (imagine we had no idea that tobacco could be related to cancer and didn't collect any tobacco-related data), so we should always apply some disciplined thinking to how we structure and interpret a statistical model. We need a few definitions to begin:

A risk factor is an aspect of personal behaviour or lifestyle, environment or characteristic thought to be associated positively or negatively with a particular adverse condition.

A counterfactual world is an epidemiological hypothetical idea of a world similar to our own in all ways but for which the exposure to the hazard, or people's behaviour or characteristics, or some other change that effects exposure, have been changed in some way.

The population attributable risk (PAR) (aka population aetiological fraction, among many others) is the proportion of the incidence in the population attributable to exposure to a risk factor. It represents the fraction by which the incidence in the population would have reduced in a counterfactual world where the effect associated with that risk factor was not present.

These concepts are often used to help model what the future might look like if we were to eliminate a risk factor but we need to be careful since they technically only refer to the comparison of an observed world and a counterfactual parallel world in which the risk factor did not appear - making predictions of the future means that we have to assume that the future world would look just like that counterfactual one.

In figuring out the PAR we may well have to consider interactions between risk factors. Consider the situation where the presence of either of two risk factors give an extremely high probability of the risk of interest, and where a significant fraction of the population is exposed to both risk factors: in that case there is a lot of overlap and an individual risk factor has less impact because the other risk factor is competing for the same victims. On the other hand, exposure to two chemicals at the same time might produce a far greater effect than either chemical alone. We talk about synergism and antagonism when the risk factors work together or against each other respectively. Synergism is more common so the PAR for the combination of two or more risk factors is usually less than the sum of their individual PARs.

A food safety example: Campylobacter

A large survey conducted by CDC (the highly reputable Centers for Disease Control and Prevention) in the United States looked at why people end up getting a certain type of food poisoning (campylobacteriosis). You get campylobacteriosis when bacteria called Campylobacter enter your intestine, find a suitably protected location and multiply (form a colony). Thus, the sequence of events resulting in campylobacteriosis must include some exposure to the bacteria, then survival of those bacteria through the stomach (the acid can kill them), then setting up a colony. In order for us to observe the infection that person has to become ill. In order to identify the disease as campylobacteriosis, a doctor has to ask for a stool sample, it has to be provided, the stool sample has to be cultured and the Campylobacter isolated and identified. Campylobacteriosis will usually resolve itself after a week or so of unpleasantness so many more people therefore have campylobacteriosis than a healthcare provider will observe.

The US survey looked at behavior patterns of people with confirmed cases and tried to match them with others of the same sex, age, etc. known not to have suffered from a foodborne illness and looked for patterns of differences. This is called a case-control study. Some of the key factors were (+ meaning positively associated with illness, - meaning negatively associated):

1.      Ate barbecued chicken (+);

2.      Ate in a restaurant (+);

3.      Were male and young (+);

4.      Had healthcare insurance (+);

5.      Are in a low socio-economic band (+);

6.      There was another member of the family with an illness (+);

7.      The person was old (+);

8.      Regularly ate chicken at home (-);

9.      Had a dog or cat (+); and

10.  Worked on a farm (+)

Let's see whether this matches our understanding of the world:

#1 makes sense since Campylobacter naturally occurs in chicken and are very frequently to be found in chicken meat. People are also somewhat less careful with their hygiene and the cooking is less controlled at a barbecue (healthcare tip: when you've cooked a piece of meat place it on a different plate than the one used to bring the raw meat to the barbecue);

#2 makes sense because of cross-contamination in the restaurant kitchen so you might eat a veggie burger, but still have consumed Campylobacter originating from a chicken;

#3 makes sense because we guys tend not to pay much attention to kitchen practices when we're young and start off rather hopeless when we first leave home;

#4 makes sense in that, in the US, visiting a doctor is expensive and that is the only way the healthcare will observe the infection;

#5 seems right because of number 4, and maybe because poorer people will eat cheaper quality food, will visit restaurants with higher capacity and lower standards (related to number 2);

#6 is obvious since fecal-oral transmission is a known route (healthcare tip: wash your hands very well particularly when you are ill);

#7 makes sense because older people have a less robust immune system, but maybe they also eat in restaurants more (less?) often, maybe they like chicken more, etc;

#8 seems strange. It appears from a number of studies that if you eat chicken at home you are less likely to get ill. Maybe that is because it displaces eating chicken at a restaurant, maybe it's because people who cook are wealthier or care more about their food, or maybe (the current theory) it is because these people get regular small exposures to Campylobacter that boosts their immune system;

#9 is trickier. Perhaps pet food contains Campylobacter, perhaps the animal gets uncooked scraps, then cross-infects the family;

#10 makes sense. People working in chicken farms are obviously more at risk, but a farm will often have just a few chickens, or will buy in manure as fertiliser or used chicken bedding as cattle feed. Other animals also carry Campylobacter.

Each of the above is a demonstrable risk factor because each passed a test of statistical significance in this study (and others) and one can find a possible rational explanation. Of course, the possible rational explanation is often to be expected because the survey was put together with questions that were designed to test suspected risk factors, not the ones that weren't thought of. Note that the causal arguments are often inter-linked in some way making it difficult to figure out the importance of each factor in isolation. Statistical software can deal with this given the appropriate control.

Evaluating evidence - three tests

The first test of causality you should make is to consider whether there is a known or possible causal mechanism that can connect two variables together. For this, you may need to think out of the box: the history of science is full of examples where people considered something impossible despite an enormous amount of evidence to the contrary, because they were so firmly attached to their pet theory.

The second test is temporal ordering: if a change in variable A has an affect on variable B then the change in A should occur before the resultant change in B. If a person dies of radiation poisoning (B) then that person must have received a large dose of radiation (A) at some previous time. We can often test for temporal ordering with statistics, usually some form of regression.

But be careful, temporal ordering doesn't imply a causal relationship. Imagine you have a variable X that affects variables A and B, but B responds faster than A. If X is unobserved all we see is that A exhibits some behaviour that strongly correlates in some way to the previous behavior of B.

The third test is to determine in some way the size of the possible causal effect. That's where statistics comes in. From a risk analysis perspective, we are usually interested in what we can change about the world. That ultimately implies that we are only really interested in determining the magnitude of the causal relationships between variables we can control and those we are interested in. Risk analysts are not scientists - our job is not to devise new theories, but to adapt the current scientific (or financial, engineering, etc) knowledge to help decision-makers make probabilistic decisions. However, it is impossible to be able to step back and ask whether a tightly-held belief is correct, and then posing the awkward questions.

It's quite possible that we can come up with an alternative explanation of the world supported by the available evidence, which is fine, but that explanation has to be presented back to the scientific community for their blessing before we can rely on it to give decision-making advice.

Read on: Types of models to analyze data