Table of Contents
Fetching ...

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random

Lorenzo Testa, Edward H. Kennedy, Matthew Reimherr

TL;DR

The paper addresses missing outcomes in Difference-in-Differences by establishing identification and semiparametric efficiency bounds under two MAR missingness mechanisms. It then constructs cross-fitted, efficient, multiplerobust estimators leveraging efficient influence functions and a nested regression (DR-Learner) augmentation to attain oracle-like performance when nuisance models are well specified. The proposed estimators are shown to be asymptotically normal and efficient, with reliability backed by extensive simulations. A real-data demonstration and discussion of extensions underscore the practical relevance for causal inference with incomplete panel data.

Abstract

The Difference-in-Differences (DiD) method is a fundamental tool for causal inference, yet its application is often complicated by missing data. Although recent work has developed robust DiD estimators for complex settings like staggered treatment adoption, these methods typically assume complete data and fail to address the critical challenge of outcomes that are missing at random (MAR) -- a common problem that invalidates standard estimators. We develop a rigorous framework, rooted in semiparametric theory, for identifying and efficiently estimating the Average Treatment Effect on the Treated (ATT) when either pre- or post-treatment (or both) outcomes are missing at random. We first establish nonparametric identification of the ATT under two minimal sets of sufficient conditions. For each, we derive the semiparametric efficiency bound, which provides a formal benchmark for asymptotic optimality. We then propose novel estimators that are asymptotically efficient, achieving this theoretical bound. A key feature of our estimators is their multiple robustness, which ensures consistency even if some nuisance function models are misspecified. We validate the properties of our estimators and showcase their broad applicability through an extensive simulation study.

Efficient Difference-in-Differences Estimation when Outcomes are Missing at Random

TL;DR

The paper addresses missing outcomes in Difference-in-Differences by establishing identification and semiparametric efficiency bounds under two MAR missingness mechanisms. It then constructs cross-fitted, efficient, multiplerobust estimators leveraging efficient influence functions and a nested regression (DR-Learner) augmentation to attain oracle-like performance when nuisance models are well specified. The proposed estimators are shown to be asymptotically normal and efficient, with reliability backed by extensive simulations. A real-data demonstration and discussion of extensions underscore the practical relevance for causal inference with incomplete panel data.

Abstract

The Difference-in-Differences (DiD) method is a fundamental tool for causal inference, yet its application is often complicated by missing data. Although recent work has developed robust DiD estimators for complex settings like staggered treatment adoption, these methods typically assume complete data and fail to address the critical challenge of outcomes that are missing at random (MAR) -- a common problem that invalidates standard estimators. We develop a rigorous framework, rooted in semiparametric theory, for identifying and efficiently estimating the Average Treatment Effect on the Treated (ATT) when either pre- or post-treatment (or both) outcomes are missing at random. We first establish nonparametric identification of the ATT under two minimal sets of sufficient conditions. For each, we derive the semiparametric efficiency bound, which provides a formal benchmark for asymptotic optimality. We then propose novel estimators that are asymptotically efficient, achieving this theoretical bound. A key feature of our estimators is their multiple robustness, which ensures consistency even if some nuisance function models are misspecified. We validate the properties of our estimators and showcase their broad applicability through an extensive simulation study.

Paper Structure

This paper contains 29 sections, 8 theorems, 40 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

Lemma 2.7

Under Assumption ass:identify and Assumption ass:mar_simple, the ATT can be identified as a function of the observed data: Under Assumption ass:identify and Assumption ass:mar_hard, the ATT can be identified as a function of the observed data:

Figures (8)

  • Figure 1: Example DAGs showing some of the causal dependencies allowed under Assumptions \ref{['ass:mar_simple']} and \ref{['ass:mar_hard']}.
  • Figure 2: Simulation results. Boxplots of Bias (top row) and Root Mean Squared Error (RMSE) (bottom row) from 500 simulation runs. The left column shows the performance of the estimator under Assumption \ref{['ass:mar_simple']}, while the right column shows the performance under Assumption \ref{['ass:mar_hard']}. Each scenario on the x-axis represents a combination of correctly specified (indicated by a star $\star$) and misspecified nuisance functions (indicated by a bar). The results visually confirm the multiple robustness property of our estimators. Both bias and RMSE are negligible when a sufficient subset of nuisance models is correctly specified. Performance degrades significantly, as theory predicts, when the conditions for multiple robustness are violated.
  • Figure 3: Simulation results with $n=2000$. Barplot of empirical coverage from 500 simulation runs. The left column shows the performance of the estimator under Assumption \ref{['ass:mar_simple']}, while the right column shows the performance under Assumption \ref{['ass:mar_hard']}. Each scenario on the x-axis represents a combination of correctly specified (indicated by a star $\star$) and misspecified nuisance functions (indicated by a bar). The results visually confirm the multiple robustness property of our estimators.
  • Figure 4: Simulation results with $n=500$. Boxplots of bias (top row), root mean squared error (RMSE) (medium row), and barplot of empirical coverage (bottom row) from 500 simulation runs. The left column shows the performance of the estimator under Assumption \ref{['ass:mar_simple']}, while the right column shows the performance under Assumption \ref{['ass:mar_hard']}. Each scenario on the x-axis represents a combination of correctly specified (indicated by a star $\star$) and misspecified nuisance functions (indicated by a bar). The results visually confirm the multiple robustness property of our estimators. Both bias and RMSE are negligible when a sufficient subset of nuisance models is correctly specified. Performance degrades significantly, as theory predicts, when the conditions for multiple robustness are violated.
  • Figure 5: Simulation results with $n=1000$. Boxplots of bias (top row), root mean squared error (RMSE) (medium row), and barplot of empirical coverage (bottom row) from 500 simulation runs. The left column shows the performance of the estimator under Assumption \ref{['ass:mar_simple']}, while the right column shows the performance under Assumption \ref{['ass:mar_hard']}. Each scenario on the x-axis represents a combination of correctly specified (indicated by a star $\star$) and misspecified nuisance functions (indicated by a bar). The results visually confirm the multiple robustness property of our estimators. Both bias and RMSE are negligible when a sufficient subset of nuisance models is correctly specified. Performance degrades significantly, as theory predicts, when the conditions for multiple robustness are violated.
  • ...and 3 more figures

Theorems & Definitions (25)

  • Remark 1.1
  • Remark 2.2
  • Example 2.5: Medical records motivation
  • Remark 2.6
  • Lemma 2.7: Identification of ATT
  • Proposition 2.8: Semiparametric efficiency bounds
  • Remark 2.9
  • Remark 3.1: Multiple robustness
  • Definition 3.2: Stability of estimator
  • Remark 3.3
  • ...and 15 more