Table of Contents
Fetching ...

Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

Hongyu Chen, David Simchi-Levi, Ruoxuan Xiong

TL;DR

This work addresses estimating population quantities under missing-not-at-random data by adopting a partial identification framework expressed through a pair of linear programs, yielding sharp bounds on the mean outcome. It introduces weak shadow variables from pretrained models, notably LLMs, as auxiliary predictions that satisfy a conditional independence constraint but need not meet classical completeness, enabling substantial tightening of identification regions. A set-expansion estimator is developed to ensure finite-sample validity and to provide convergence rates that adapt to whether identification is partial or point, with faster rates when the shadow information yields point identification. The approach is extended to randomized experiments and validated via numerical simulations and semi-synthetic analyses on customer-service dialogues, showing that LLM-derived weak shadows shrink bound widths by 75–83% while maintaining valid coverage under MNAR. The results highlight a practical path to leverage rich external models to improve inference under missing data without relying on brittle parametric assumptions or perfect predictive accuracy.

Abstract

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.

Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

TL;DR

This work addresses estimating population quantities under missing-not-at-random data by adopting a partial identification framework expressed through a pair of linear programs, yielding sharp bounds on the mean outcome. It introduces weak shadow variables from pretrained models, notably LLMs, as auxiliary predictions that satisfy a conditional independence constraint but need not meet classical completeness, enabling substantial tightening of identification regions. A set-expansion estimator is developed to ensure finite-sample validity and to provide convergence rates that adapt to whether identification is partial or point, with faster rates when the shadow information yields point identification. The approach is extended to randomized experiments and validated via numerical simulations and semi-synthetic analyses on customer-service dialogues, showing that LLM-derived weak shadows shrink bound widths by 75–83% while maintaining valid coverage under MNAR. The results highlight a practical path to leverage rich external models to improve inference under missing data without relying on brittle parametric assumptions or perfect predictive accuracy.

Abstract

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than- convergence rate in the set-identified regime and the standard rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.
Paper Structure (30 sections, 18 theorems, 99 equations, 4 figures, 3 tables)

This paper contains 30 sections, 18 theorems, 99 equations, 4 figures, 3 tables.

Key Result

Proposition 1

The sharp identification region for $\theta$ given observed MNAR data $\{R_i, R_iY_i\}_{i=1}^n$ is $\Theta = [\theta_{\min}, \theta_{\max}]$ defined in opt:mean-mnar-range, which has closed-form solutions:

Figures (4)

  • Figure 1: Causal diagram depicting the relationships among observed covariates $X$, true outcome $Y$, observation indicator $R$, and predicted outcome $F$. We assume conditional independence $F \perp \!\!\! \perp R \mid Y, X$. The dashed arrow from $F$ to $Y$ indicates an optional dependence; we do not require the full shadow variable assumption.
  • Figure 2: Causal diagram for the experimental setting with shadow variable. Covariates $X$ affect both the outcome $Y$ and prediction $F$. Treatment $D$ affects $Y$, $R$, and $F$. The outcome $Y$ affects the observation indicator $R$. Crucially, there is no direct edge from $F$ to $R$, reflecting the conditional independence assumption.
  • Figure 3: Estimator comparison for MNAR simulation data. The Set Expansion estimator (blue) provides valid bounds containing the true mean $\mu=3.0$ in both (a) point-identified and (b) set-identified settings, while point estimators (CCA, NI, PPI, Heckman, PM) remain biased. Shaded regions: oracle bounds without shadow variable (green) and with shadow variable (orange). Averaged over 100 replications.
  • Figure 4: Estimator comparison under three MNAR patterns. Blue: Set Expansion bounds. Green: Aggregated LP bounds. Red line: true mean $\mu = 3.73$. Points with error bars: point estimators with 95% CIs.

Theorems & Definitions (20)

  • Proposition 1
  • Proposition 2
  • Definition 1: Shadow Variable
  • Definition 2: Completeness of $\mathbb{P}(Y \mid X, F, R=1)$
  • Proposition 3
  • Proposition 4
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Proposition 5
  • ...and 10 more