Miscellaneous

In class yesterday we reviewed a few key concepts, which I’m going to revisit here for posterity.

Also, here is a link to a nice monograph on regression discontinuity designs (RDD). Please read chapters 1 and 2 before class tomorrow (a total length of 17 pages).

The first agenda item is to consider in more detail the proof of Theorem 1 in Angrist and Imbens (1991). The theorem shows that the average effect of the treatment on the treated (ATT) is identified if two assumptions are satisfied. The proof is elementary, but the presentation in the working paper is extremely telegraphic. Here we fill in details. The two conditions are:

there exists an instrumental variable $Z$ such that $E(Y_0 \mid Z = z) = E(Y_0)$ for all $z$ , (exclusion restriction)
$P(D = 1 \mid Z = 0) = 0$ . (eligibility instrument)

The first condition obtains, for example, if $Z$ arises independently of any of the factors affecting the outcome (other than the treatment assignment) and prior to the treatment assignment $D$ . The second condition can be interpreted as saying that $Z$ denotes eligibility to receive treatment.

First, we must show that $E(Y \mid Z = 0) = E(Y_0)$ , (a fact that is merely stated in the paper). This claim requires both assumptions. To show it, we write the observed outcome as $Y = Y_0 + D(Y_1 - Y_0)$ and then take expectations conditional on $Z = 0$ :

$E(Y \mid Z = 0) = E(Y_0 \mid Z = 0) + E(D(Y_1 - Y_0) \mid Z = 0)$ .

The first term on the right becomes $E(Y_0)$ by the first assumption. The second term can be written with iterated expectation as

$E[E[(D(Y_1 - Y_0) \mid Z = 0, D)] \mid Z = 0] = \sum_d P(D = d \mid Z = 0) d E(Y_1 - Y_0 \mid D = d, Z = 0) = P(D = 1 \mid Z = 0) E(Y_1 - Y_0 \mid D = 1, Z = 0) = 0$

by the second assumption.

Next, we consider $E(Y \mid Z = 1)$ , taking the same approach as above, but with $Z = 1$ this time:

$E(Y \mid Z = 1) = E(Y_0 \mid Z = 1) + E(D(Y_1 - Y_0) \mid Z = 1)$ .

As before, the first term is $E(Y_0)$ by the first assumption. And, by the first part above, is equal to $E(Y \mid Z = 0)$ (an estimable quantity). Iterated expectation on the second term gives $P(D = 1 \mid Z = 1) E(Y_1 - Y_0 \mid D = 1, Z = 1)$ .

Next, we recognize that we can substitute

$E(Y_1 - Y_0 \mid D = 1, Z = 1) = E(Y_1 - Y_0 \mid D = 1)$

by the second assumption: restricting to situations where $Z = 1$ and $D = 1$ is exactly restricting to situations where just $D = 1$ , because all of these have $Z = 1$ by assumption.

Putting all these pieces together allows us to solve for the desired treatment effect as

$E(Y_1 - Y_0 \mid D = 1) = [E(Y \mid Z = 1) - E(Y \mid Z = 0)]/P(D = 1 \mid Z = 1)$

The next item is to look at identification of the treatment effect in linear instrumental variables models. For simplicity, we will consider a continuous treatment and a continuous instrument. In this case our structural equation representation is

$Y = f(D, U, \epsilon) = \tau D + \beta U + \epsilon$

$D = g(Z, U, \nu) = \eta Z + \gamma U + \nu$

where $Z, U, \nu, \epsilon$ are all mutually independent. The shared dependence on $U$ is what makes direct regression on $D$ inappropriate for determining the treatment effect.

However, observe that by substituting the equation for $D$ into the equation for $Y$ yields

$Y = \tau (\eta Z + \gamma U + \nu) + \beta U + \epsilon = \tau\eta Z + (\tau\gamma U + \tau \nu + \beta U + \epsilon)$ .

But here we can recognize that the “error term”

$\tilde{\epsilon} = (\tau\gamma U + \tau \nu + \beta U + \epsilon)$

is now independent of $Z$ , meaning that regression of $Y$ on $Z$ will yield an estimate of $\tau\eta$ . A regression of $D$ on $Z$ will yield an estimate of $\eta$ and $\tau$ can be obtained as a ratio.

Likewise, if we had $\eta$ in hand, we could regress $Y$ on $\eta Z$ to obtain an estimate of $\tau$ . This approach, using a “first stage” estimate of $\eta$ , is called “two-stage least squares” or 2SLS.

Here is an R script briefly demonstrating these ideas in action.

Finally, we took a look at how to think about the do-operator, as distinct from “vanilla” conditioning. The basic insight is that conditioning is “filtering”, while “do-ing” is “short-circuiting”.

In more detail, if you think about the data generating process as an algorithm that takes stochastic inputs, conditioning proceeds by generating all of the outputs and then simply restricting the output to those realizations satisfying a particular condition (hence “conditioning”).

The do-operator, by contrast, generates all of the data and then replaces all of the draws of the one variable and fixes them to a certain value and then re-generates any variables that depend (structural/causally) on the variable you replaced.

Here is a script that briefly demonstrates this idea in action.

Share this:

Related

Leave a comment Cancel reply