Table of Contents
Fetching ...

Zero Inflation as a Missing Data Problem: a Proxy-based Approach

Trung Phung, Jaron J. R. Lee, Opeyemi Oladapo-Shittu, Eili Y. Klein, Ayse Pinar Gurses, Susan M. Hannum, Kimberly Weems, Jill A. Marsteller, Sara E. Cosgrove, Sara C. Keller, Ilya Shpitser

TL;DR

This work reframes zero-inflated data as a missing-data problem where the censoring indicator is itself unobserved when zeros occur. By introducing proxy variables for the missingness mechanism and adapting the Kuroki–Pearl approach, the authors derive conditions for point identification and sharp bounds across MCAR, MAR, and MNAR settings within m-DAGs. They establish analytic bounds for key proxy-conditional distributions, provide numerical bound methods, and demonstrate both simulation validation and a CLABSI data application to illustrate practical sensitivity analyses. The results offer a principled path to identifiability or informative partial identification in zero-inflated contexts, with implications for diverse domains using recording-constrained data.

Abstract

A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).

Zero Inflation as a Missing Data Problem: a Proxy-based Approach

TL;DR

This work reframes zero-inflated data as a missing-data problem where the censoring indicator is itself unobserved when zeros occur. By introducing proxy variables for the missingness mechanism and adapting the Kuroki–Pearl approach, the authors derive conditions for point identification and sharp bounds across MCAR, MAR, and MNAR settings within m-DAGs. They establish analytic bounds for key proxy-conditional distributions, provide numerical bound methods, and demonstrate both simulation validation and a CLABSI data application to illustrate practical sensitivity analyses. The results offer a principled path to identifiability or informative partial identification in zero-inflated contexts, with implications for diverse domains using recording-constrained data.

Abstract

A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).
Paper Structure (31 sections, 12 theorems, 73 equations, 8 figures, 3 tables)

This paper contains 31 sections, 12 theorems, 73 equations, 8 figures, 3 tables.

Key Result

Lemma 1

Given a ZI model associated with any m-DAG ${\cal G}$, both the target law $p(X^{(1)})$ and the full law $p(X^{(1)}, R, C)$ are non-parametrically non-identified.

Figures (8)

  • Figure 1: Missing data scenarios represented by m-DAG. Circle nodes denote observed variables, while others nodes are unobserved. Gray edges denote deterministic nature of $p(X_i \mid R_i, X^{(1)}_i)$ due to consistency. (a) $X^{(1)}_1$ is MCAR since $R_1 \mathop{\mathrm{\perp\!\!\!\perp}}\limits X^{(1)}$. (b) $X^{(1)}_1$ is MAR since $R_1 \mathop{\mathrm{\perp\!\!\!\perp}}\limits X^{(1)} \mid C$. (c) $X^{(1)}_1,X^{(1)}_2,X^{(1)}_3$ are MNAR, since observability indicators $R_1,R_2,R_3$ are are not independent of these variables, either marginally or given observed variables.
  • Figure 2: Examples of proxy-augmented ZI MCAR model (a) and ZI MAR models (b and c). A1, A2 holds in (a), A1$^\dag$, A2$^{\dag}$ hold in (b), and A1$^*$, A2$^*$ hold in (c). Unlike missing data, indicator $R$ is partially observed.
  • Figure 3: Examples of proxy-augmented ZI MNAR models. (a) ZI bivariate block-parallel model. (b) ZI bivariate block-sequencial MAR model.
  • Figure 4: CLABSI rate consistent with model compatible distributions $p(W \mid R)$ under the ZI MAR model with assumptions A1$^*$, and A2$^*$.
  • Figure 5: The graph considered in Theorem \ref{['thm:zi_mcar_bound']}: proxy-augmented ZI MCAR model satisfying A1 and A2 (Fig. \ref{['fig:zi_mcar_mar']} a in the main paper).
  • ...and 3 more figures

Theorems & Definitions (22)

  • Lemma 1: Non-identifiability
  • Theorem 1: ZI law restoration in ZI MCAR
  • Remark
  • Theorem 2: ZI law restoration
  • Proposition 1: ZI full law identification
  • Theorem 3: ZI MCAR compatibility bound
  • Theorem 4: ZI MAR compatibility bound 1
  • Theorem 5: ZI MAR compatibility bound 2
  • Lemma 2
  • Lemma 3
  • ...and 12 more