Table of Contents
Fetching ...

Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach

Zixiao Wang, AmirEmad Ghassami, Ilya Shpitser

TL;DR

This paper introduces a data-fusion framework to identify and estimate a target mean under MNAR by augmenting a primary MNAR dataset with an auxiliary MAR dataset. It presents two complementary models—Model 1 where missingness depends on another variable and Model 2 leveraging a shadow-variable approach with an odds-ratio link—and corresponding IPW estimators that rely on information from both domains. The authors prove identification under the stated assumptions, validate the approach through simulations, and apply it to NYS COVID-19 hospitalization data to illustrate practical impact. The work offers a principled path to MNAR identification in two-domain settings and motivates future semiparametric efficiency developments.

Abstract

We consider the task of identifying and estimating a parameter of interest in settings where data is missing not at random (MNAR). In general, such parameters are not identified without strong assumptions on the missing data model. In this paper, we take an alternative approach and introduce a method inspired by data fusion, where information in an MNAR dataset is augmented by information in an auxiliary dataset subject to missingness at random (MAR). We show that even if the parameter of interest cannot be identified given either dataset alone, it can be identified given pooled data, under two complementary sets of assumptions. We derive an inverse probability weighted (IPW) estimator for identified parameters, and evaluate the performance of our estimation strategies via simulation studies, and a data application.

Identification and Estimation for Nonignorable Missing Data: A Data Fusion Approach

TL;DR

This paper introduces a data-fusion framework to identify and estimate a target mean under MNAR by augmenting a primary MNAR dataset with an auxiliary MAR dataset. It presents two complementary models—Model 1 where missingness depends on another variable and Model 2 leveraging a shadow-variable approach with an odds-ratio link—and corresponding IPW estimators that rely on information from both domains. The authors prove identification under the stated assumptions, validate the approach through simulations, and apply it to NYS COVID-19 hospitalization data to illustrate practical impact. The work offers a principled path to MNAR identification in two-domain settings and motivates future semiparametric efficiency developments.

Abstract

We consider the task of identifying and estimating a parameter of interest in settings where data is missing not at random (MNAR). In general, such parameters are not identified without strong assumptions on the missing data model. In this paper, we take an alternative approach and introduce a method inspired by data fusion, where information in an MNAR dataset is augmented by information in an auxiliary dataset subject to missingness at random (MAR). We show that even if the parameter of interest cannot be identified given either dataset alone, it can be identified given pooled data, under two complementary sets of assumptions. We derive an inverse probability weighted (IPW) estimator for identified parameters, and evaluate the performance of our estimation strategies via simulation studies, and a data application.
Paper Structure (26 sections, 7 theorems, 55 equations, 3 figures, 7 tables)

This paper contains 26 sections, 7 theorems, 55 equations, 3 figures, 7 tables.

Key Result

Theorem 1

Under Assumptions ass1:mr_g1, ass2:mg_x, and ass3:yr_xmg2, parameter $\beta = \mathbb{E} [Y^{(1)}|G =1]$ is identified using the following functional where $g_1(X,M) \equiv \mathbb{E}[Y |X,M,G=1,R=1]$.

Figures (3)

  • Figure 1: Graphical models: (a) The auxiliary MAR data domain in both Model 1 and Model2 . (b) Primary MNAR domain in Model 1. (c)The pooled data for Model 1, including the selection at random mechanism. (d) Primary MNAR domain in Model 2. (e) The pooled data for Model 2, including the selection at random mechanism. Notice the text $G=1$ on the dotted arrow denotes a context-dependent relationship.
  • Figure 2: Simulation results for Model 1 and Model 2: Bias for estimation of $\mathbb{E} [ Y^{(1)}\mid G=1 ]$. Boxplots of correct and misspecified settings, calculated from 1000 trials at sample sizes $n \in \{500, 1000,2000\}$. The red point indicates the mean. Statistics of boxplots are in Table \ref{['tab:box.m1']} and Table \ref{['tab:box.m2']}. (a) MAR estimates are clearly biased upwards. (b) IPW estimates, though slightly biased, concentrate around the true value as sample size increases. (c) IPW estimates are less biased than MAR. (d) IPW estimates are less biased than MAR estimates. Statistics of boxplots are in Table \ref{['tab:box.m1']} and Table \ref{['tab:box.m2']} in Appendix.
  • Figure 3: Boxplot of Bootstrap results (size = $1000$) using IPW estimator in Model 1, IPW estimator in Model 2, MAR estimator, and MCAR estimator. The blue dash line shows the MCAR estimate: 0.7533. The statistical summary is calculated in Table \ref{['tab:add_covid']} in the Appendix.

Theorems & Definitions (7)

  • Theorem 1: Identification in Model 1
  • Proposition 1: miao2015identification
  • Theorem 2
  • Proposition 2
  • Proposition 3
  • Lemma 1
  • Lemma 2