Bayesian probit regression

Regression analysis for dichotomous (binary) data usually proceeds by specifying a link function that maps a linear model on the real line back to the unit interval. The most common link function is the logit link, leading to logistic regression. The likelihood function in this case is simply a Bernoulli likelihood for each data point. This is the most well-known case of a generalized linear model.

In a Bayesian setting, it is computationally more convenient (as I will describe) to use a probit link, which is the cumulative distribution function (CDF) of a standard normal random variable. In this class I will provide no detailed discussion of how the probit and the logit link differ, except to say that in many practical situations the distinction will be minimal (especially for the purpose of classification tasks).

The reason the probit model is so useful in a Bayesian set up is that one can consider the dichotomous data as arising via a deterministic transformation of a continuous latent variable:

$Y_i = 1(Z_i > 0).$

If we then assume that

$Z_i \sim N(X_i\beta, 1)$

we can deduce that

$P(Y_i = 1) = \Phi(X_i\beta),$

that is, a probit model.

How did we arrive at this? Note that the probability that $Z_i > 0$ is the integral from zero to positive infinity of a normal density function with variance one centered at $X_i \beta$ . By a change of variables shifting this density over by $-X_i \beta$ , we see this area is the same as that under the curve of a standard normal random variable from $-X_i\beta$ to positive infinity. But because the standard normal is symmetric about zero, we see that this is the same as the area under the curve from negative infinity to $X_i\beta$ , hence the result.

Anyway, the point of this is that performing inference for the regression coefficients of a probit model reduces to iteratively sampling a linear regression model and then sampling the latent variables $Z_i$ . The convenient thing about sampling the $Z_i$ variable is that, conditional on $X_i\beta$ , the updates are independent across observations and are simply draws from a truncated normal distribution.

Fortunately, truncated normal distributions are easy to draw from using the inverse CDF method. See this demo script.

For the original paper presenting this idea, see Albert and Chib (1993).

Share this:

Related

Leave a comment Cancel reply