Causal Inference

This is a blog for my project on the applications of causal inference to machine learning.


What is causal inference and why do I care?

What do we really mean when we say “the economy does better when my favorite political party is in power” or what do we really mean when we say “unvaccinated adults are 3-5 times more likely to get omicron infection”. A very natural property of interest in any dataset is the pattern of correlation - when two events are statistically related. Correlation is an important part of statistics, but what’s more important is how we tend to interpret it. Correlations tell us that two events tend to share some pattern and not that one event is the reason for the other occuring. We really want to say that “the economy is better because of my favorite political party, and not just they happen to occur together. Establishing causal relationships between phenomenon is the study of causal inference. You might have heard the phrase ‘correlation is not the same as causation’. This is true. While it’s true that both the number of tech companies and the number of teddy bears sold has gone up over the last few decades, one does not necessarily cause the other (perhaps both are caused by, among lots of other things, an increase in population sizes leading to more children buying teddy bears and a bigger scientific community built over the years). But this is an example of events that are correlated - they both values have gone up together - but probably not causally related.

I’m interested in causal inference, at the highest level, because it is, in some sense, the most natural question to ask when studying complex datasets. Questions about causation are often the most interesting and, so, inherently the most enlightening in any survey. We know when the economy rises and falls but it’s establishing its cause thats the most interesting. Mathematically, this presents much richer questions: how does one quantitatively establish causation between variables? how is causation structurally different from phenomenon to phenomenon? and how do we quantify that difference? These questions have a lot of implication on not only our understanding of economic and natural phenomenon but also about the human mind and our understanding of how we learn things and form such associations and establish legitimate and unfound causal relationships. I’m interested in these sorts of questions and am looking to explore these in greater depth because I’m fascinated by the question of causation, the mathematical bases on which it can be established, and the understanding of human behavior and learning patterns that this brings.


What question am I interested in?

I’m interested in the applications of causal inference to machine learning (ML). The main aim for any ML problem is to predict a label y from a feature vector x by training an algorithm on some known features. Different approaches require different training data: supervised learning refers to the method of training algorithms using feature vectors with known labels, while semisupervised learning (SSL) refers to the method of training algorithms with some data with known labels and some with unknown labels. The underlying assumption in most ML problems is that we are trying to find associations or correlations between the labels and features to make predictions; we normally don’t care about their causal structure. It doesn’t seem to matter if x causes y or vice versa since algorithms are expected to perform the same. However, this has shown to not necessarily be true. Scholköpf et al. showed in their 2012 ICML paper that there is in fact a difference in performance in these two cases. SSL algorithms performed better on anticausal problems (i.e. where x causes y ) than in causal problems (vice versa). Figure 1 below illustrates this.

The benefit of SSL depends on the causal structure. Each column of points corresponds to a benchmark data set from the UCI repository and shows the performance of six different base classifiers augmented with self-training, a generic method for SSL. Performance is measured by percentage decrease of error relative to the base classifier, that is, (error(base)−error(self-train))/error(base). Self-training overall does not help for the causal data sets, but it does help for some of the anticausal confounded data sets [from [Scholköpf et al. 2012](https://icml.cc/2012/papers/625.pdf)].
The benefit of SSL depends on the causal structure. Each column of points corresponds to a benchmark data set from the UCI repository and shows the performance of six different base classifiers augmented with self-training, a generic method for SSL. Performance is measured by percentage decrease of error relative to the base classifier, that is, (error(base)−error(self-train))/error(base). Self-training overall does not help for the causal data sets, but it does help for some of the anticausal confounded data sets [from Scholköpf et al. 2012].

Here we see that the performance of SSL on anticausal problems and causal is considerably different to supervised learning. I’m interested in looking at why this is the case and what kinds of problems do better with SSL vs supervised learning.

A toy example where we can see this is in the case of two variables

$$X = N_1$$ $$Y = N_1 + X$$

Where $N_1, N_2 \overset{\mathrm{iid}}{\sim} Be(\frac{1}{2})$. Here there is a clear causal dependency $X \rightarrow Y$.

We can factor the joint probability distribution as

$$P(X,Y) = P(X) P(Y|X) = P(Y) P(X|Y)$$

This factorization is not symmetric since we know that with the current variables, we get the probabilities

$$P(X = x) = 1/2 \text{ for } x=0,1$$$$P(Y=y) = \begin{cases} \frac{1}{4} & \text{if }y=0,2 \newline \frac{1}{2} & \text{if }y=1\end{cases}$$$$P(X=x | Y = 0) = \begin{cases} 1 & \text{if }x=0 \newline 0 & \text{if }x=1\end{cases}$$ $$P(Y=y | X = 0) = \begin{cases} \frac{1}{2} & \text{if }y=0 \newline \frac{1}{2} & \text{if }y=1 \newline 0 & \text{if }y=2\end{cases}$$

In ML, we’re usually trying to understand the distribution $ P(Y|X)$ , i.e. given the training data, what is the probability of having any given label. But notice that when we change the distribution of X to, say, $Be(\frac{2}{3})$, then we get the probabilities

$$P(X=x) = \begin{cases} \frac{1}{3} & \text{if }x = 0 \newline \frac{2}{3} & \text{if }x=1\end{cases}$$$$P(Y=y) = \begin{cases} \frac{1}{4} & \text{if }y=0,2 \newline \frac{1}{2} & \text{if }y=1\end{cases}$$$$P(X=x | Y = 0) = \begin{cases} 1 & \text{if }x=0 \newline 0 & \text{if }x=1\end{cases}$$ $$P(Y=y | X = 0) = \begin{cases} \frac{1}{3} & \text{if }y=0 \newline \frac{2}{3} & \text{if }y=1 \newline 0 & \text{if }y=2\end{cases}$$

Notice that the distribution for $ P(X|Y)$ does not change but the distribution for $ P(Y|X)$ does. This is a two-variable example to show the assymetry in causal vs anticausal problems and why certain machine learning algorithms might learn $ P(Y|X)$ differently in these cases. The goal for this project is to study this phenomenon. Specifically, I’m interested in the following questions:

  1. Can we prove that SSL performs better in anticausal learning problems than other techniques, even for specific cases?

  2. Is there an example of a dataset that is traditionally studied with ML techniques agnostic of causal structure that might be better suited to SSL techniques due to an underlying causal structure?


Identifiability

To understand this phenomenon better, first, let’s focus on the data and go back to our original motivation: discovering causal structure and finding the causal relationships between different variables. Given featuresX and labels Y, can we distinguish between the causal directions$X \rightarrow Y$ and $Y \rightarrow X$? That is to say, can we ascribe a deterministic mechanism $X \xrightarrow{\varphi} Y$ as apposed to $Y \xrightarrow{\psi} X$ .

A basic assumption that we will make is that each variable is affected by (independent) noise variables $N_Y$ (and $N_X$ for the reverse causal structure). So, the model we will work with is the graph in figure 2.

Causal Graph from X to Y
Causal Graph from X to Y

This is a basic structural causal model (SCM) - a model which includes a graph with one vertex per variable, jointly independent noise variables, and a deterministic function for each vertex that depends on the vertex’s parents and noise. This tells us we have a deterministic mechanism $\varphi$ that gives us $Y$ as $Y = \varphi(X, N_Y)$. The question of identifying graphs in causal learning then becomes a question of whether we can ascribe another mechanism from$Y$ to $X$ as shown in figure 3.

Causal Graph from Y to X
Causal Graph from Y to X

It turns out that without any restrictions on the mechanism, the problem of identifying the graph is not feasible. Given variables$X,Y$ along with their probability distribution $P(X,Y)$, you can always fit a mechanism $\varphi$ for either $X \rightarrow Y$ or $Y \rightarrow X$ (the noise variables being understood to be embedded in the graph). This is in the following proposition which states:

Proposition 1. Given two random variables $X$ and $Y$, there is a (measurable) function$f$ and random variable $ N_Y $ such that $Y = f(X, N_Y)$

Click to see proof

This isn’t the main proof of this section, so I’ll refer you to this paper by Zhang et al. for a nice proof. See Lemma 1.

Without any conditions, the indentification problem is not feasible. This should intuitively feel good - without any assumptions, we have no control over how sensitive the mechanism is to the noise and hence the complexity of $P(Y|X)$.

The question thus becomes: Under what conditions can we only ascribe one causal mechanism between the two variables and not the other (i.e. identify the causal graphs). One possibility is that we make add some structure of the model. So, in our work, we will make the following assumptions. The first will be an assumption about our model and the second will be general SCM assumptions:

  1. The mechanism is an Additive Noise Model (ANM), i.e.$$\varphi (X, N_Y) = \phi(X) + N_Y$$ for some function $\phi$, where $N_Y$ and X are independent.

  2. The usual assumptions of SCMs also hold: the independence of the mechanism and input ($P(Y | X)$ and $X$), Markov compatibility ($P(Y)$ just depends on it’s parent $X$ - which is trivial in our case), and causal sufficiency.

The first assumption is our main one. Hoyer et al. showed in their 2009 NeurIPS paper that under the ANM assumption, we can recover the causal mechanism in most cases. Before we state the main theorem, it is worth noting that since$X$ and $N_Y$ are independent, the joint distribution $ P(X,Y) $ can be rewritten as

$$ P(X = x, Y = y) = P(X = x, N_y = y - \phi(x))$$ $$= P_{X}(X = x) \cdot P_{N_Y}(N_Y = y - \phi(x)) $$

This equivalent restatement will make the theorem easier to state:

Theorem 1. Given $ X, N_Y$ , and $Y$ that satisfy an ANM with a function $ \phi$, if there is a backward mechanism of the same form, then $\phi, P_X, P_{N_Y}$ must satisfy the following differential equation:

$$ \xi''' = \xi'' \left( -\frac{\nu''' \phi'}{\nu''} + \frac{\phi''}{\phi'} \right) - 2 \nu '' \phi'' \phi' + \nu' \phi''' + \frac{\nu' \nu''' \phi'' \phi'}{\nu''} - \frac{\nu' (\phi'')^2}{\phi'}$$

where $\nu := \operatorname{log} P_{N_Y}$ and $\xi := \operatorname{log}P_X$, and we also have that $\nu''(y-\phi(x))\phi'(x) \neq 0$ .

Also, we have that if these conditions hold, then if there is a$y$ for which $\nu''(y-\phi(x))\phi'(x) \neq 0$ is true for every $ x $ aside from a countable set, then the set of all $ P_X $ which admit a backward model is 3-dimensional (i.e. can be contained in a 3-dimensional affine space).

Of course, here we are assuming that all the relevant functions are thrice differentiable.

Click here if you’re interested in the proof

You can find my write up and notes on the proof here. The general idea of the proof is that just as we can factor the joint distribution $P_{X,Y}$ into a product of distributions $P_X \cdot P_{N_Y}$, if a backword model exists, we could similarly factor it into $P_Y \cdot P_{N_X}$. Taking second order partial derivatives of (the logs of) both of these factorizations give different expressions which we set to be equal since they’re the derivatives of the same distribution. Setting the derivatives equal and rearranging terms gives us part one. Part two follows from the theory of linear differential equations.

The differential equation seems quite arbitrary, but let’s understand at a high level what this tells us. The differential equation is of order 3 (since the highest derivative taken is a third derivative). We can also see that it is linear in $\xi$; that is, it can be written as

$$a_0 + a_2 \xi'' + a_3 \xi ''' = b(x,y)$$

and the solution space must be 3-dimensional (see proof for details). Given that the dimension of the space of every possible solution$\xi$ is infinite dimensional, this tells us that almost every ANM$\varphi$ does not admit a backwards model since most $ \phi, P_X, P_{N_Y} $ don’t satisfy the above differential equation.

On a side note, a fun corollary of this theorem is that if $ \xi''' = \nu''' = 0$ everywhere, then the only $\phi$ for which a backward model exists is for linear $\phi$. This tells us that, for example, in the case where $ P_X$ and $P_{N_Y}$ are Gaussian, only linear$\phi$ admit backward models.

Click here if you’re interested in the proof

This proof follows pretty much exactly as in the Hoyer paper, but if you’d like to see it, you can find my write up and notes here

What this tells us about the identification problem is that if given variables $X,Y$, if we can fit an ANM $\varphi : X \rightarrow Y$, then with probability 1, we can assume that the causal relationship is $X \rightarrow Y$. In the next section, we’ll include some experiments to show how we can use this theoretical results in some practical cases with both with real data and synthetic examples. Then, we’ll move to studying the SSL problem under these conditions.

For some further reading for anyone interested, Zhang and Hyvärinen showed in their 2010 AUI paper similar results for postlinear ANM models, which apply a further nonlinear transformation to the ANM we used. The proof techniques used there are quite similar to the proof technique used here!


Experiments

So how do we put this into practice? We’ll work through this by working backwards. What we saw was that if we can fit an ANM to data $X,Y$, then we can say with probability one that the correct causal relationship is $ X \rightarrow Y $, unless, of course, the model is linear in $Y$ , which is the corollary mentioned after theorem 1. Here is the gameplan:

Step 1: We’ll set up our data X and Y assuming they are not statistically independent. Of course, in practice, we would need to test this, but for the sake of explanation here, we will assume this.

Step 2: We’ll then perform a regression of $Y$ onto $X$ . More specifically, we’ll split the data first into training data for our model and testing data; so really, training a regression model to regress $Y_{\text{train}}$ onto $X_{\text{train}}$ and then make our predictions on the testing set $\hat{\phi} (X_{\text{test}})$.

Step 3: We’ll finally compute the estimates of the residuals $\hat{N} = Y_{\text{test}} - \hat{\phi} (X_{\text{test}})$.

Step 4: Then, we test whether$\hat{N}$ is statistically independent from $X_{\text{test}}$. If it is, then the model$Y = \phi(X) + N_X$ is consistent with the data and we accept the ANM otherwise we reject it.

Step 5: To see if we have a backward model as well, we’ll do the same the same process to see if the model$X = \psi(Y) + N_Y$ is consistent with the data. If so, we’ll accept the reverse model; if not, we reject it.

Once we do this, if neither model is accepted, then we failed to fit an ANM either way and we can’t make any causal inferences from the experiment. If one model is accepted and the other rejected, we have evidence to support the respective causal direction - forward or reverse. If both models are accepted, then either causal direction is valid.

A couple of technical details: the regression that I am going to use is sklearn’s Support Vector Regression (SVR). One can use more advanced regression methods, but I found that this is sufficient for our purposes. To test independence, I will be using the HSIC from the work of Gretton et al. in their 2007 NeurIPS Paper - a test that is now very standard in applications. All code can be found on my github.

Experiment 1 - Synthetic Data

In our first experiment, we’ll recreate some of Hoyer et al’s results and use synthetic data - a good place to start with these sorts of experiments.

We sample$X$ and $N$ from a standard normal Gaussian distribution and take their absolute values to the power $q$ - a parameter to measure how Gaussian the distributions are. $q=1$ gives us a usual Gaussian distribution; $q>1$ gives us a super-Gaussian distribution; $q<1$ gives us a sub-Gaussian distribution. We’ll then take$Y = X + bX^3 + N$ where $b$ is a parameter that controls the amount of linearity. Notice that $$b = 0 \implies Y = X + N$$ and so $Y$ is linear in that case.

Probability of acceptance vs varying values of b with fixed q = 1
Probability of acceptance vs varying values of b with fixed q = 1

For $q=1$, we take 100 values of$b$ between$-1$ and$1$ and took the average of 50 experiments in each case. Figure 4 shows the probability of the forward and reverse models being accepted. As we would expect, the probability of the reverse model being accepted is significant only around $b=0$ where the model is linear and essentially $ 0 $ otherwise, whereas the forward (correct) model is usually accepted.

Probability of acceptance vs varying values of q with fixed b = 0
Probability of acceptance vs varying values of q with fixed b = 0

For $b=0$, i.e. the linear case, we similarly take 100 values of$q$ between $ 0.5 $ and $ 2 $ and plot the average of 50 experiments per value of$q$ in figure 5. The reverse model is accepted around $ q=1 $ as seen before and rejected for the super-Gaussian and sub-Gaussian cases. So the model can solve the linear sub-Gaussian and super-Gaussian cases.

Experiment 2 - Real Data

“Great! We can find causal relationships in data where we build those relationships in. Now what?”

I’m glad you asked! Let’s shift our focus to a real dataset. For the purposes of illustration the method in practice, we’ll use the UCI abalone dataset. Abalone are snails and it’s tricky to determine their age. To do this, one needs to cut the shell through the cone, stain it, and count the number of rings through a microscope. The number of rings is closely related to its age and gives us our best estimate, but, as UCI puts it, “[it’s a] boring and time consuming task.” A standard machine learning problem is to use easier to obtain measurements, such as the length of its shell among other things, to predict its age. Our feature, $X$ is the length of the shell and the label we want to predict, $Y$ is its age. This is a fine task, but in the spirit of causal learning, let’s ask instead: what is the causal relationship between $X$ and $Y$ ? Does age cause the length of the shell or does the length cause age? Here, it’s obvious that it’s the former - we expect that length causes age. This is good, because we can use our theory to verify this hypothesis.

After running the same experiments as in experiment 1, we get the following regressions and residuals.

SVR regression of length onto age
SVR regression of length onto age
Residuals of length-to-age regression
Residuals of length-to-age regression
SVR regression of age onto length
SVR regression of age onto length
Residuals of age-to-length regression
Residuals of age-to-length regression

The HSIC score for the testing data for the forward model $X \rightarrow Y$ was$0.6$ with the threshold being $ 0.3$. So the forward model was rejected. The HSIC score for the reverse model $Y \rightarrow X$ was $ 0.2 $ with a similar threshold and so the reverse model was accepted. Thus we have seen that this is a case of anticausal learning and, in fact, it’s the age of the snail that causes the length of the shell and not the other way around.

These are simple examples that serve as evidence to the interesting claim that one can use machine learning techniques to find causal relationships in data. This method does generalize to higher numbers of variables and so in future sections, we’ll see some more interesting examples of this. Stay tuned!

Karan Srivastava
Karan Srivastava
PhD Student, Mathematics

My research interests include algebraic geometry, number theory, and machine learning