Potential outcomes

The potential outcome formalism of Donald Rubin and Jerzy Neyman is a key development in modern causal inference. One of our textbooks has a pretty good list of scholarly references. See also this blog post of Gelman’s for some interesting discussion about the intellectual history of potential outcomes in economics.

The assigned paper by Holland defines the fundamental problem of causal inference:  we never get to observe the counterfactual reality. We only get to see the state of events that happened, not the alternative reference point that did not happen. The same person at the same time cannot both take and not take a drug. The potential outcome framework takes this idea and gives it explicit notation, which makes it quite a lot easier to think straight about and even permits some calculation to be performed.

Begin by considering, for a given individual indexed by subscript, a response/outcome variable y_i . (For concreteness, you can think about blood pressure or something.) Now let’s express this observed outcome in terms of two hypothetical, or “potential” outcomes, which we denote y_i^1 and y_i^0 for “did” and “did not” receive some drug, respectively. Then our observed outcome can be written as y_i = D_i y_i^1 + (1-D_i)y_i^0 where D_i is the observed indicator of whether the individual actually (as opposed to potentially or hypothetically) took the drug. So in a causal inference scenario we consider a triple (y_i^0, y_i^1, D_i) , of which we only get to see two, and which of (y_i^0, y_i^1) we get to see depends on D_i . The individual treatment effect can now be defined as \tau_i = y_i^1 - y_i^0 .

Let’s take a look at how this set up can buy us a little extra clarity. Let us begin by considering the sample average treatment effect (SATE) in a fixed sample of N individuals:

\bar{\tau} \equiv \frac{1}{N}\sum_{i=1}^{N} (y_i^1 - y_i^0) =\frac{1}{N}\sum_{i = 1}^N \tau_i = \mbox{E}_N(\tau),

where \mbox{E}_N(\cdot) denotes the expectation over the sample.

Next, consider that nature has assigned individuals in this sample to an observed treatment condition; in other words, let us regard D_i as independent and identically distributed draws of a random variable D . Thus, we may express \bar{\tau} in terms of the observed data by invoking the law of iterated expectation

\begin{aligned} \bar{\tau} &= \mbox{E}(\mbox{E}_N(\tau \mid D)) = \mbox{E}(\mbox{E}_N(y^1 - y^0 \mid D)) \\ &= \pi (\mbox{E}_N(y^1 \mid D = 1) - \mbox{E}_N( y^0 \mid D =1)) + (1 - \pi)(\mbox{E}_N(y^1 \mid D = 0) - \mbox{E}_N( y^0 \mid D =0)), \end{aligned}

where \pi = \mbox{Pr}(D = 1). This expression contains a total of five terms, three of which we can readily estimate: \pi\mbox{E}_N(y^1 \mid D = 1), and \mbox{E}_N(y^0 \mid D = 0). The other two terms,  \mbox{E}_N(y^1 \mid D = 0), and \mbox{E}_N(y^0 \mid D = 1) would seemingly be inaccessible to us due to the fundamental problem.

However, this is where — perhaps counterintuitively — randomization comes to the rescue. If D \perp (y^0,y^1), as is the case under randomized treatment assignment, then \mbox{E}_N(y^1 \mid D = 1) =\mbox{E}_N(y^1 \mid D = 0) and \mbox{E}_N(y^0 \mid D = 0) =\mbox{E}_N(y^0 \mid D = 1) and \bar{\tau} can be estimated.


It pays dividends in causal inference to keep clear the distinction between three related and similar sounding nouns: estimands, estimators, and estimates.  These are the “quantities we want to estimate”, the “algorithms we use to estimate them” and the actual “output of these estimation algorithms,” respectively. In particular, there are many different average treatment effects we might care about, differing by which group of individuals we are averaging over; for each distinct estimand we seek a distinct estimator.

Above, we considered the sample average treatment effect, or SATE. Another common estimand is the so-called population average treatment effect, or PATE. The PATE makes an assumption that the observed sample is representative of a well-defined (at least in principle) super-population and the expectation in the corresponding ATE isn’t the sample average, but the population expectation. For the PATE we might write

\bar{\tau} = \mbox{E}(\tau) = \mbox{E}(Y_1 - Y_0),

dropping the N and now referencing the random variable (Y^0, Y^1). This is my preferred mode of operation, as someone comfortable modeling a wide range of phenomenon probabilistically and who usually conducts inference in a Bayesian framework. The main appeals of restricting attention to the sample average treatment effect, as I see it, are a model-free ethos and the availability of elegant and intuitive permutation tests for frequentist inference. For more on this distinction read carefully section 2.11 of the textbook.

Additionally, one may consider the average treatment effect among substantively unique subpopulations, such as only those who actually receive treatment or only those who do not receive treatment. These two ATEs are referred to as the “average treatment on the treated” or the ATT, and the “average treatment on the controls” or ATC. Their definitions are

\bar{\tau}_{ATT} = \mbox{E}(\tau \mid D = 1) = \mbox{E}(Y^1 - Y^0 \mid D = 1),

and

\bar{\tau}_{ATC} = \mbox{E}(\tau \mid D = 0) = \mbox{E}(Y^1 - Y^0 \mid D = 0),

respectively. These two estimands may each be estimated under weaker assumptions than randomized treatment assignment.

EXERCISE:  By manipulating the expression in the previous section, derive estimators for the ATE, ATT, and ATC, stating under what moment conditions they are feasible. In the case of randomized treatment assignment, do the three estimands differ?


The above discussion skipped over an important assumption called SUTVA, which I will describe and write up in the next entry.


Additional textbooks worth looking into:

Causality by Judea Pearl.

Baby Pearl.

Imbens and Rubin.

Leave a comment