Table of Contents
Fetching ...

Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm

R. Teal Witter, Christopher Musco

Abstract

Estimating the effect of treatments from natural experiments, where treatments are pre-assigned, is an important and well-studied problem. We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Surprisingly, applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. To address this, we create a benchmark to evaluate estimator accuracy using synthetic outcomes, whose design was guided by domain experts. The benchmark extensively explores performance as real world conditions like sample size, treatment correlation, and propensity score accuracy vary. Based on our benchmark, we observe that the class of doubly robust treatment effect estimators, which are based on simple and intuitive regression adjustment, generally outperform other more complicated estimators by orders of magnitude. To better support our theoretical understanding of doubly robust estimators, we derive a closed form expression for the variance of any such estimator that uses dataset splitting to obtain an unbiased estimate. This expression motivates the design of a new doubly robust estimator that uses a novel loss function when fitting functions for regression adjustment. We release the dataset and benchmark in a Python package; the package is built in a modular way to facilitate new datasets and estimators.

Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm

Abstract

Estimating the effect of treatments from natural experiments, where treatments are pre-assigned, is an important and well-studied problem. We introduce a novel natural experiment dataset obtained from an early childhood literacy nonprofit. Surprisingly, applying over 20 established estimators to the dataset produces inconsistent results in evaluating the nonprofit's efficacy. To address this, we create a benchmark to evaluate estimator accuracy using synthetic outcomes, whose design was guided by domain experts. The benchmark extensively explores performance as real world conditions like sample size, treatment correlation, and propensity score accuracy vary. Based on our benchmark, we observe that the class of doubly robust treatment effect estimators, which are based on simple and intuitive regression adjustment, generally outperform other more complicated estimators by orders of magnitude. To better support our theoretical understanding of doubly robust estimators, we derive a closed form expression for the variance of any such estimator that uses dataset splitting to obtain an unbiased estimate. This expression motivates the design of a new doubly robust estimator that uses a novel loss function when fitting functions for regression adjustment. We release the dataset and benchmark in a Python package; the package is built in a modular way to facilitate new datasets and estimators.
Paper Structure (25 sections, 2 theorems, 28 equations, 12 figures, 12 tables, 1 algorithm)

This paper contains 25 sections, 2 theorems, 28 equations, 12 figures, 12 tables, 1 algorithm.

Key Result

Theorem 4.1

When the propensity scores are known exactly, the doubly robust estimator with split training $\hat{\tau}(\mathbf{z})$ is unbiased i.e., $\mathop{\mathrm{\mathbb{E}}}\nolimits_{\mathbf{z}, S_1, S_2}[\hat{\tau}(\mathbf{z}) - \tau]=0$ with variance given by

Figures (12)

  • Figure 1: Normalized CMAS scores for five equal sized groups plotted against their propensity scores. The treatment has more effect on children who are likely to receive the treatment.
  • Figure 2: Synthetic outcomes designed in consultation with domain experts: Treatments are targeted to under-served children who benefit more neuman1999booksneuman2001access.
  • Figure 3: Mean treatment rate and mean propensity score among observations with similar propensity scores. Because the predicted and actual treatment rates are close to the identity line, we conclude the propensity scores are well calibrated.
  • Figure 4: Squared error of each estimator by the number of observations. The darker line is the median and the shaded region encompasses the first and third quartile across 100 runs. The doubly robust estimator, followed by Double-Double, achieve the lowest squared error.
  • Figure 5: Squared error by distance correlation. The doubly robust estimator and Double-Double outperform the other estimators until the distance correlation surpasses .8.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem D.1
  • proof