Mixing & Matching in R for Data Science

I’ve spent time over the last few months attempting to enhance my skills in the statistical sub-field of causal inference.

Overly simplified, causal inference comprises a series of methodologies and techniques to assist analysts in making the jump from association or correlation to cause and effect. How can one progress from noting a correlation between factors X and Y to making a sound conclusion that X causes Y? For example, the association between X and Y could arise from X causing Y, Y causing X, or both X and Y being influenced by a third factor Z. The field of causal inference purports to sort out these possibilities.

The platinum design for causal inference is the experiment where subjects are randomly assigned to the different treatment groups. With randomization, the effects of uncontrolled or confounding factors (the Z’s) should, within sampling limitations, be “equal” or “balanced” across treatments or values of X. In such settings of “controlled” Z’s, the analyst is much more confident that a correlation between X and Y actually indicates causality.

But what about the in-situ data gathering schemes generally seen in the data science world, where data are observational, and confounders are free to roam? What is one to do? The answer: consider causal inference techniques that attempt to statistically mimic the randomized experiment.

Years ago, I routinely deployed analysis of covariance methods that included confounders directly in the linear models. If the X’s had better coefficients than the confounders, then X caused Y. Not a very compelling solution. Over the years, I’ve tried propensity scores that attempt to summarize the confounders into a single score. Then there’re matching techniques like those used below. The idea behind matching is to comprise a control group that looks similar to treatment on the distribution of confounders. If successful, the confounders are balanced and hence other things are essentially equal — so a strong residual relationship between X and Y is evidence of cause and effect. That is, provided the critical ignorability assumption that all important confounders have been identified a priori holds.

While searching the web for CM didactics, I came across a well-assembled, intermediate-level, Coursera curriculum,A Crash Course in Causality: Inferring Causal Effects from Observational Data. Turns out to be just what the statistical doctor ordered. I invested a day absorbing many of Professor Jason Roy’s accessible lectures, including a full analysis example worked through in R. I then took what I’d learned and applied it to my own data with my own R code.

What follows is an application of matching techniques to data from the American Community Survey census data. The question I’m asking is if sample respondents with a terminal masters degree earn more than those with a terminal bachelors. The treatment is thus MA, the control BA. The confounders considered are age (MA respondents are older), sex (more female MA respondents), and marital status (MA respondents are more likely to be married), among others.

I first subset all terminal bachelors and masters respondents to a data set. I then take a random sample of 250,000 of these cases. Finally, I match treatment (has a terminal MA) to control (has a terminal BA). I then invoke the R MatchIt package for dense, greedy, nearest neighbor matching on the identified confounders. This process ultimately produces an R data.table with a single matched control for each treatment record. It turns out that MA respondents do indeed have considerably higher average incomes than BA’s. But when matched on the confounders, the mean average annual difference is reduced by over $2,500. This makes sense, since, among other things, the MA sample is several years older on average than the BA’s.

The technology used is JupyterLab with Microsoft Open R, 3.4.4. For the matching work, the MatchIt and tableone packages are deployed.

Read the remainder of the blog here.