Causal Inference
This is a blog for my project on the applications of causal inference to machine learning.
What is causal inference and why do I care?
What do we really mean when we say “the economy does better when my favorite political party is in power” or what do we really mean when we say “unvaccinated adults are 35 times more likely to get omicron infection”. A very natural property of interest in any dataset is the pattern of correlation  when two events are statistically related. Correlation is an important part of statistics, but what’s more important is how we tend to interpret it. Correlations tell us that two events tend to share some pattern and not that one event is the reason for the other occuring. We really want to say that “the economy is better because of my favorite political party, and not just they happen to occur together. Establishing causal relationships between phenomenon is the study of causal inference. You might have heard the phrase ‘correlation is not the same as causation’. This is true. While it’s true that both the number of tech companies and the number of teddy bears sold has gone up over the last few decades, one does not necessarily cause the other (perhaps both are caused by, among lots of other things, an increase in population sizes leading to more children buying teddy bears and a bigger scientific community built over the years). But this is an example of events that are correlated  they both values have gone up together  but probably not causally related.
I’m interested in causal inference, at the highest level, because it is, in some sense, the most natural question to ask when studying complex datasets. Questions about causation are often the most interesting and, so, inherently the most enlightening in any survey. We know when the economy rises and falls but it’s establishing its cause thats the most interesting. Mathematically, this presents much richer questions: how does one quantitatively establish causation between variables? how is causation structurally different from phenomenon to phenomenon? and how do we quantify that difference? These questions have a lot of implication on not only our understanding of economic and natural phenomenon but also about the human mind and our understanding of how we learn things and form such associations and establish legitimate and unfound causal relationships. I’m interested in these sorts of questions and am looking to explore these in greater depth because I’m fascinated by the question of causation, the mathematical bases on which it can be established, and the understanding of human behavior and learning patterns that this brings.
What question am I interested in?
I’m interested in the applications of causal inference to machine learning (ML). The main aim for any ML problem is to predict a label y from a feature vector x by training an algorithm on some known features. Different approaches require different training data: supervised learning refers to the method of training algorithms using feature vectors with known labels, while semisupervised learning (SSL) refers to the method of training algorithms with some data with known labels and some with unknown labels. The underlying assumption in most ML problems is that we are trying to find associations or correlations between the labels and features to make predictions; we normally don’t care about their causal structure. It doesn’t seem to matter if x causes y or vice versa since algorithms are expected to perform the same. However, this has shown to not necessarily be true. Scholköpf et al. showed in their 2012 ICML paper that there is in fact a difference in performance in these two cases. SSL algorithms performed better on anticausal problems (i.e. where x causes y ) than in causal problems (vice versa). Figure 1 below illustrates this.
Here we see that the performance of SSL on anticausal problems and causal is considerably different to supervised learning. I’m interested in looking at why this is the case and what kinds of problems do better with SSL vs supervised learning.
A toy example where we can see this is in the case of two variables
$\$\$X\; =\; N\_1\$\$$ $\$\$Y\; =\; N\_1\; +\; X\$\$$
Where $\$N\_1,\; N\_2\; \backslash overset\{\backslash mathrm\{iid\}\}\{\backslash sim\}\; Be(\backslash frac\{1\}\{2\})\$$. Here there is a clear causal dependency $\$X\; \backslash rightarrow\; Y\$$.
We can factor the joint probability distribution as
$\$\$P(X,Y)\; =\; P(X)\; P(YX)\; =\; P(Y)\; P(XY)\$\$$
This factorization is not symmetric since we know that with the current variables, we get the probabilities
$\$\$P(X\; =\; x)\; =\; 1/2\; \backslash text\{\; for\; \}\; x=0,1\$\$$$\$\$P(Y=y)\; =\; \backslash begin\{cases\}\; \backslash frac\{1\}\{4\}\; \&\; \backslash text\{if\; \}y=0,2\; \backslash newline\; \backslash frac\{1\}\{2\}\; \&\; \backslash text\{if\; \}y=1\backslash end\{cases\}\$\$$$\$\$P(X=x\; \; Y\; =\; 0)\; =\; \backslash begin\{cases\}\; 1\; \&\; \backslash text\{if\; \}x=0\; \backslash newline\; 0\; \&\; \backslash text\{if\; \}x=1\backslash end\{cases\}\$\$$ $\$\$P(Y=y\; \; X\; =\; 0)\; =\; \backslash begin\{cases\}\; \backslash frac\{1\}\{2\}\; \&\; \backslash text\{if\; \}y=0\; \backslash newline\; \backslash frac\{1\}\{2\}\; \&\; \backslash text\{if\; \}y=1\; \backslash newline\; 0\; \&\; \backslash text\{if\; \}y=2\backslash end\{cases\}\$\$$
In ML, we’re usually trying to understand the distribution $\$\; P(YX)\$$, i.e. given the training data, what is the probability of having any given label. But notice that when we change the distribution of X to, say, $\$Be(\backslash frac\{2\}\{3\})\$$, then we get the probabilities
$\$\$P(X=x)\; =\; \backslash begin\{cases\}\; \backslash frac\{1\}\{3\}\; \&\; \backslash text\{if\; \}x\; =\; 0\; \backslash newline\; \backslash frac\{2\}\{3\}\; \&\; \backslash text\{if\; \}x=1\backslash end\{cases\}\$\$$$\$\$P(Y=y)\; =\; \backslash begin\{cases\}\; \backslash frac\{1\}\{4\}\; \&\; \backslash text\{if\; \}y=0,2\; \backslash newline\; \backslash frac\{1\}\{2\}\; \&\; \backslash text\{if\; \}y=1\backslash end\{cases\}\$\$$$\$\$P(X=x\; \; Y\; =\; 0)\; =\; \backslash begin\{cases\}\; 1\; \&\; \backslash text\{if\; \}x=0\; \backslash newline\; 0\; \&\; \backslash text\{if\; \}x=1\backslash end\{cases\}\$\$$ $\$\$P(Y=y\; \; X\; =\; 0)\; =\; \backslash begin\{cases\}\; \backslash frac\{1\}\{3\}\; \&\; \backslash text\{if\; \}y=0\; \backslash newline\; \backslash frac\{2\}\{3\}\; \&\; \backslash text\{if\; \}y=1\; \backslash newline\; 0\; \&\; \backslash text\{if\; \}y=2\backslash end\{cases\}\$\$$
Notice that the distribution for$\$\; P(XY)\$$ does not change but the distribution for$\$\; P(YX)\$$ does. This is a twovariable example to show the assymetry in causal vs anticausal problems and why certain machine learning algorithms might learn$\$\; P(YX)\$$ differently in these cases. The goal for this project is to study this phenomenon. Specifically, I’m interested in the following questions:

Can we prove that SSL performs better in anticausal learning problems than other techniques, even for specific cases?

Is there an example of a dataset that is traditionally studied with ML techniques agnostic of causal structure that might be better suited to SSL techniques due to an underlying causal structure?
Identifiability
To understand this phenomenon better, first, let’s focus on the data and go back to our original motivation: discovering causal structure and finding the causal relationships between different variables. Given features$X$ and labels $Y$, can we distinguish between the causal directions$\$X\; \backslash rightarrow\; Y\$$ and $\$Y\; \backslash rightarrow\; X\$$? That is to say, can we ascribe a deterministic mechanism $\$X\; \backslash xrightarrow\{\backslash varphi\}\; Y\$\; as\; apposed\; to\; \$Y\; \backslash xrightarrow\{\backslash psi\}\; X\$$.
A basic assumption that we will make is that each variable is affected by (independent) noise variables $\$N\_Y\$\; (and\; \$N\_X\$\; for\; the\; reverse\; causal\; structure)$. So, the model we will work with is the graph in figure 2.
This is a basic structural causal model (SCM)  a model which includes a graph with one vertex per variable, jointly independent noise variables, and a deterministic function for each vertex that depends on the vertex’s parents and noise. This tells us we have a deterministic mechanism$\$\backslash varphi\$$ that gives us $\$Y\$\; as\; \$Y\; =\; \backslash varphi(X,\; N\_Y)\$$. The question of identifying graphs in causal learning then becomes a question of whether we can ascribe another mechanism from$\$Y\$\; to\; \$X\$$ as shown in figure 3.
It turns out that without any restrictions on the mechanism, the problem of identifying the graph is not feasible. Given variables$\$X,Y\$$ along with their probability distribution $\$P(X,Y)\$$, you can always fit a mechanism $\$\backslash varphi\$$ for either $X \rightarrow Y$ or $Y \rightarrow X$ (the noise variables being understood to be embedded in the graph). This is in the following proposition which states:
Proposition 1. Given two random variables $\$X\$\; and\; \$Y\$$, there is a (measurable) function$\$f\$$ and random variable$\$\; N\_Y\; \$$ such that $\$Y\; =\; f(X,\; N\_Y)\$$
Click to see proof
This isn’t the main proof of this section, so I’ll refer you to this paper by Zhang et al. for a nice proof. See Lemma 1.
Without any conditions, the indentification problem is not feasible. This should intuitively feel good  without any assumptions, we have no control over how sensitive the mechanism is to the noise and hence the complexity of $\$P(YX)\$$.
The question thus becomes: Under what conditions can we only ascribe one causal mechanism between the two variables and not the other (i.e. identify the causal graphs). One possibility is that we make add some structure of the model. So, in our work, we will make the following assumptions. The first will be an assumption about our model and the second will be general SCM assumptions:

The mechanism is an Additive Noise Model (ANM), i.e.$\$\$\backslash varphi\; (X,\; N\_Y)\; =\; \backslash phi(X)\; +\; N\_Y\$\$$ for some function
$\phi$, where $N_Y$ and X are independent. 
The usual assumptions of SCMs also hold: the independence of the mechanism and input ($\$P(Y\; \; X)\$\; and\; \$X\$$), Markov compatibility ($\$P(Y)\$\; just\; depends\; on\; it\u2019s\; parent\; \$X\$\; \; which\; is\; trivial\; in\; our\; case$), and causal sufficiency.
The first assumption is our main one. Hoyer et al. showed in their 2009 NeurIPS paper that under the ANM assumption, we can recover the causal mechanism in most cases. Before we state the main theorem, it is worth noting that since$\$X\$\; and\; \$N\_Y\$$ are independent, the joint distribution$\$\; P(X,Y)\; \$$ can be rewritten as
$\$\$\; P(X\; =\; x,\; Y\; =\; y)\; =\; P(X\; =\; x,\; N\_y\; =\; y\; \; \backslash phi(x))\$\$\; \$\$=\; P\_\{X\}(X\; =\; x)\; \backslash cdot\; P\_\{N\_Y\}(N\_Y\; =\; y\; \; \backslash phi(x))\; \$\$$
This equivalent restatement will make the theorem easier to state:
Theorem 1. Given$\$\; X,\; N\_Y\$\; ,\; and\; \$Y\$$ that satisfy an ANM with a function $\$\; \backslash phi\$$, if there is a backward mechanism of the same form, then $\phi, P_X, P_{N_Y}$ must satisfy the following differential equation:
$\$\$\; \backslash xi\text{'}\text{'}\text{'}\; =\; \backslash xi\text{'}\text{'}\; \backslash left(\; \backslash frac\{\backslash nu\text{'}\text{'}\text{'}\; \backslash phi\text{'}\}\{\backslash nu\text{'}\text{'}\}\; +\; \backslash frac\{\backslash phi\text{'}\text{'}\}\{\backslash phi\text{'}\}\; \backslash right)\; \; 2\; \backslash nu\; \text{'}\text{'}\; \backslash phi\text{'}\text{'}\; \backslash phi\text{'}\; +\; \backslash nu\text{'}\; \backslash phi\text{'}\text{'}\text{'}\; +\; \backslash frac\{\backslash nu\text{'}\; \backslash nu\text{'}\text{'}\text{'}\; \backslash phi\text{'}\text{'}\; \backslash phi\text{'}\}\{\backslash nu\text{'}\text{'}\}\; \; \backslash frac\{\backslash nu\text{'}\; (\backslash phi\text{'}\text{'})^2\}\{\backslash phi\text{'}\}\$\$$
where $\$\backslash nu\; :=\; \backslash operatorname\{log\}\; P\_\{N\_Y\}\$\; and\; \$\backslash xi\; :=\; \backslash operatorname\{log\}P\_X\$$, and we also have that $\$\backslash nu\text{'}\text{'}(y\backslash phi(x))\backslash phi\text{'}(x)\; \backslash neq\; 0\$$.
Also, we have that if these conditions hold, then if there is a$\$y\$\; for\; which\; \$\backslash nu\text{'}\text{'}(y\backslash phi(x))\backslash phi\text{'}(x)\; \backslash neq\; 0\$$ is true for every$\$\; x\; \$$ aside from a countable set, then the set of all$\$\; P\_X\; \$$ which admit a backward model is 3dimensional (i.e. can be contained in a 3dimensional affine space).
Of course, here we are assuming that all the relevant functions are thrice differentiable.
Click here if you’re interested in the proof
You can find my write up and notes on the proof here. The general idea of the proof is that just as we can factor the joint distribution$\$P\_\{X,Y\}\$$ into a product of distributions $\$P\_X\; \backslash cdot\; P\_\{N\_Y\}\$$, if a backword model exists, we could similarly factor it into $\$P\_Y\; \backslash cdot\; P\_\{N\_X\}\$$. Taking second order partial derivatives of (the logs of) both of these factorizations give different expressions which we set to be equal since they’re the derivatives of the same distribution. Setting the derivatives equal and rearranging terms gives us part one. Part two follows from the theory of linear differential equations.
The differential equation seems quite arbitrary, but let’s understand at a high level what this tells us. The differential equation is of order 3 (since the highest derivative taken is a third derivative). We can also see that it is linear in $\$\backslash xi\$$; that is, it can be written as
$\$\$a\_0\; +\; a\_2\; \backslash xi\text{'}\text{'}\; +\; a\_3\; \backslash xi\; \text{'}\text{'}\text{'}\; =\; b(x,y)\$\$$
and the solution space must be 3dimensional (see proof for details). Given that the dimension of the space of every possible solution$\$\backslash xi\$$ is infinite dimensional, this tells us that almost every ANM$\$\backslash varphi\$$ does not admit a backwards model since most$\$\; \backslash phi,\; P\_X,\; P\_\{N\_Y\}\; \$$ don’t satisfy the above differential equation.
On a side note, a fun corollary of this theorem is that if$\$\; \backslash xi\text{'}\text{'}\text{'}\; =\; \backslash nu\text{'}\text{'}\text{'}\; =\; 0\$$ everywhere, then the only $\$\backslash phi\$\; for\; which\; a\; backward\; model\; exists\; is\; for\; linear\; \$\backslash phi\$$. This tells us that, for example, in the case where$\$\; P\_X\$\; and\; \$P\_\{N\_Y\}\$$ are Gaussian, only linear$\$\backslash phi\$$ admit backward models.
Click here if you’re interested in the proof
This proof follows pretty much exactly as in the Hoyer paper, but if you’d like to see it, you can find my write up and notes here
What this tells us about the identification problem is that if given variables $\$X,Y\$$, if we can fit an ANM $\$\backslash varphi\; :\; X\; \backslash rightarrow\; Y\$$, then with probability 1, we can assume that the causal relationship is $\$X\; \backslash rightarrow\; Y\$$. In the next section, we’ll include some experiments to show how we can use this theoretical results in some practical cases with both with real data and synthetic examples. Then, we’ll move to studying the SSL problem under these conditions.
For some further reading for anyone interested, Zhang and Hyvärinen showed in their 2010 AUI paper similar results for postlinear ANM models, which apply a further nonlinear transformation to the ANM we used. The proof techniques used there are quite similar to the proof technique used here!
Experiments
So how do we put this into practice? We’ll work through this by working backwards. What we saw was that if we can fit an ANM to data $\$X,Y\$$, then we can say with probability one that the correct causal relationship is $\$\; X\; \backslash rightarrow\; Y\; \$$, unless, of course, the model is linear in $\$Y\$$, which is the corollary mentioned after theorem 1. Here is the gameplan:
Step 1: We’ll set up our data X and Y assuming they are not statistically independent. Of course, in practice, we would need to test this, but for the sake of explanation here, we will assume this.
Step 2: We’ll then perform a regression of $\$Y\$\; onto\; \$X\$$. More specifically, we’ll split the data first into training data for our model and testing data; so really, training a regression model to regress$\$Y\_\{\backslash text\{train\}\}\$\; onto\; \$X\_\{\backslash text\{train\}\}\$$ and then make our predictions on the testing set $\$\backslash hat\{\backslash phi\}\; (X\_\{\backslash text\{test\}\})\$$.
Step 3: We’ll finally compute the estimates of the residuals $\$\backslash hat\{N\}\; =\; Y\_\{\backslash text\{test\}\}\; \; \backslash hat\{\backslash phi\}\; (X\_\{\backslash text\{test\}\})\$$.
Step 4: Then, we test whether$\$\backslash hat\{N\}\$$ is statistically independent from $\$X\_\{\backslash text\{test\}\}\$$. If it is, then the model$\$Y\; =\; \backslash phi(X)\; +\; N\_X\$$ is consistent with the data and we accept the ANM otherwise we reject it.
Step 5: To see if we have a backward model as well, we’ll do the same the same process to see if the model$\$X\; =\; \backslash psi(Y)\; +\; N\_Y\$$ is consistent with the data. If so, we’ll accept the reverse model; if not, we reject it.
Once we do this, if neither model is accepted, then we failed to fit an ANM either way and we can’t make any causal inferences from the experiment. If one model is accepted and the other rejected, we have evidence to support the respective causal direction  forward or reverse. If both models are accepted, then either causal direction is valid.
A couple of technical details: the regression that I am going to use is sklearn’s Support Vector Regression (SVR). One can use more advanced regression methods, but I found that this is sufficient for our purposes. To test independence, I will be using the HSIC from the work of Gretton et al. in their 2007 NeurIPS Paper  a test that is now very standard in applications. All code can be found on my github.
Experiment 1  Synthetic Data
In our first experiment, we’ll recreate some of Hoyer et al’s results and use synthetic data  a good place to start with these sorts of experiments.
We sample$\$X\$\; and\; \$N\$$ from a standard normal Gaussian distribution and take their absolute values to the power$\$q\$$  a parameter to measure how Gaussian the distributions are.$\$q=1\$$ gives us a usual Gaussian distribution;$\$q>1\$$ gives us a superGaussian distribution;$\$q<1\$$ gives us a subGaussian distribution. We’ll then take$\$Y\; =\; X\; +\; bX^3\; +\; N\$$ where$\$b\$$ is a parameter that controls the amount of linearity. Notice that$\$\$b\; =\; 0\; \backslash implies\; Y\; =\; X\; +\; N\$\$$ and so$\$Y\$$ is linear in that case.
For $\$q=1\$$, we take 100 values of$\$b\$$ between$\$1\$$ and$\$1\$$ and took the average of 50 experiments in each case. Figure 4 shows the probability of the forward and reverse models being accepted. As we would expect, the probability of the reverse model being accepted is significant only around$\$b=0\$$ where the model is linear and essentially$\$\; 0\; \$$ otherwise, whereas the forward (correct) model is usually accepted.
For $\$b=0\$$, i.e. the linear case, we similarly take 100 values of$\$q\$$ between$\$\; 0.5\; \$\; and\; \$\; 2\; \$$ and plot the average of 50 experiments per value of$\$q\$$ in figure 5. The reverse model is accepted around$\$\; q=1\; \$$ as seen before and rejected for the superGaussian and subGaussian cases. So the model can solve the linear subGaussian and superGaussian cases.
Experiment 2  Real Data
“Great! We can find causal relationships in data where we build those relationships in. Now what?”
I’m glad you asked! Let’s shift our focus to a real dataset. For the purposes of illustration the method in practice, we’ll use the UCI abalone dataset. Abalone are snails and it’s tricky to determine their age. To do this, one needs to cut the shell through the cone, stain it, and count the number of rings through a microscope. The number of rings is closely related to its age and gives us our best estimate, but, as UCI puts it, “[it’s a] boring and time consuming task.” A standard machine learning problem is to use easier to obtain measurements, such as the length of its shell among other things, to predict its age. Our feature,$\$X\$$ is the length of the shell and the label we want to predict,$\$Y\$$ is its age. This is a fine task, but in the spirit of causal learning, let’s ask instead: what is the causal relationship between$\$X\$$ and $\$Y\$$? Does age cause the length of the shell or does the length cause age? Here, it’s obvious that it’s the former  we expect that length causes age. This is good, because we can use our theory to verify this hypothesis.
After running the same experiments as in experiment 1, we get the following regressions and residuals.
The HSIC score for the testing data for the forward model$\$X\; \backslash rightarrow\; Y\$$ was$\$0.6\$$ with the threshold being $\$\; 0.3\$$. So the forward model was rejected. The HSIC score for the reverse model$\$Y\; \backslash rightarrow\; X\$$ was$\$\; 0.2\; \$$ with a similar threshold and so the reverse model was accepted. Thus we have seen that this is a case of anticausal learning and, in fact, it’s the age of the snail that causes the length of the shell and not the other way around.
These are simple examples that serve as evidence to the interesting claim that one can use machine learning techniques to find causal relationships in data. This method does generalize to higher numbers of variables and so in future sections, we’ll see some more interesting examples of this. Stay tuned!