Table of Contents
Fetching ...

What Is a Good Imputation Under MAR Missingness?

Jeffrey Näf, Erwan Scornet, Julie Josse

TL;DR

This work analyzes missing data imputation under MAR from a distributional perspective, showing that per-variable imputation within FCS can identify correct conditionals even in nonparametric settings, provided overlap holds. It introduces three guiding properties for ideal imputers—distributional regression capability, nonlinearity capture, and robustness to covariate shifts—and presents mice-DRF, a distributional random forest-based imputation method meeting two of these criteria. The authors advocate energy distance as a principled evaluation metric for distributional fidelity, and demonstrate through simulations and the Air Quality dataset that RMSE often misorders methods, while distributional scores better reflect downstream performance. The findings stress that strong distributional shifts within MAR complicate imputation and that progress hinges on methods capable of drawing from the full conditional distributions under shifting covariate supports. Overall, the paper provides a principled framework for evaluating and developing MAR imputation methods beyond traditional RMSE-focused benchmarks. $ ext{Key contributions include}$: (i) formal MAR definitions and identifiability results for sequential, per-variable imputation; (ii) a concrete set of criteria for ideal imputation methods and the introduction of mice-DRF; (iii) an energy-distance-based evaluation framework for imputation quality; and (iv) empirical evidence that distributional imputers outperform marginal predictors in MAR settings with covariate shifts, while highlighting remaining challenges when shifts are strong.

Abstract

Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first investigate whether the widely used fully conditional specification (FCS) approach indeed identifies the correct conditional distributions. Based on this analysis, we propose three essential properties an ideal imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We also discuss ways to compare imputation methods, based on distributional distances. Finally, numerical experiments illustrate the points made in this discussion.

What Is a Good Imputation Under MAR Missingness?

TL;DR

This work analyzes missing data imputation under MAR from a distributional perspective, showing that per-variable imputation within FCS can identify correct conditionals even in nonparametric settings, provided overlap holds. It introduces three guiding properties for ideal imputers—distributional regression capability, nonlinearity capture, and robustness to covariate shifts—and presents mice-DRF, a distributional random forest-based imputation method meeting two of these criteria. The authors advocate energy distance as a principled evaluation metric for distributional fidelity, and demonstrate through simulations and the Air Quality dataset that RMSE often misorders methods, while distributional scores better reflect downstream performance. The findings stress that strong distributional shifts within MAR complicate imputation and that progress hinges on methods capable of drawing from the full conditional distributions under shifting covariate supports. Overall, the paper provides a principled framework for evaluating and developing MAR imputation methods beyond traditional RMSE-focused benchmarks. : (i) formal MAR definitions and identifiability results for sequential, per-variable imputation; (ii) a concrete set of criteria for ideal imputation methods and the introduction of mice-DRF; (iii) an energy-distance-based evaluation framework for imputation quality; and (iv) empirical evidence that distributional imputers outperform marginal predictors in MAR settings with covariate shifts, while highlighting remaining challenges when shifts are strong.

Abstract

Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. The present paper attempts to take a step back and provide a more systematic analysis. Starting from an in-depth discussion of the Missing at Random (MAR) condition for nonparametric imputation, we first investigate whether the widely used fully conditional specification (FCS) approach indeed identifies the correct conditional distributions. Based on this analysis, we propose three essential properties an ideal imputation method should meet, thus enabling a more principled evaluation of existing methods and more targeted development of new methods. In particular, we introduce a new imputation method, denoted mice-DRF, that meets two out of the three criteria. We also discuss ways to compare imputation methods, based on distributional distances. Finally, numerical experiments illustrate the points made in this discussion.
Paper Structure (25 sections, 16 theorems, 108 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 16 theorems, 108 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Lemma 2.1

Condition SMAR is equivalent to SMARII.

Figures (14)

  • Figure 1: Illustration of Example \ref{['Example1_first']}. Left: Distribution we would like to impute $X_1 \mid M=m_3$. Middle: Distribution of $X_1$ in the fully observed pattern $(X_1 \mid M=m_1)$. Right: Distribution of $X_1$ in the second pattern $(X_1 \mid M=m_2)$.
  • Figure 2: $\mathbf{X}$ is the assumed underlying full data, $\mathbf{M}$ is the vector of missing indicators and $\mathbf{X}^*$ arises when $\mathbf{M}$ is applied to $\mathbf{X}$. Thus each row of $\mathbf{X}$ (or $\mathbf{X}^*$) is an observation under a different pattern. Under condition \ref{['CIMAR']}, the distribution of $X_1, X_2 \mid X_3$ is not allowed to change when moving from one pattern to another, though the marginal distribution of $X_3$ is allowed to change. In contrast, under MCAR \ref{['MCARform']}, no change is allowed. Under MAR \ref{['PMMMAR']}, the only constraint is that the distribution of $X_1, X_2 \mid X_3$ in the third pattern is the same as the unconditional one.
  • Figure 3: Relationships between the MAR conditions discussed in this paper. An arrow from condition $A$ to condition $B$, encodes that $A$ implies $B$. The definitions are given in Section \ref{['subsec:Mardefinitions']}.
  • Figure 4: Illustration of the distribution of $(X_1, X_2) \mid M \in L_{m_4}$, Left: Univariate distribution of $X_1 \mid M \in L_{m_4}$, Middle: Univariate distribution of $X_2 \mid M \in L_{m_4}$, Right: Joint distribution with density overlay.
  • Figure 5: Left: An example where \ref{['overlapdef']} is not met. Right: An example where \ref{['overlapdef']} is met. In both cases, fully observed point are drawn in dark yellow, while only $X_2$ is shown in the pattern with missing $X_1$ in blue. In both cases $X_1 \mid X_2$ remains the same.
  • ...and 9 more figures

Theorems & Definitions (40)

  • Example 1
  • Definition 2.1: SM-MAR
  • Definition 2.2: SM-MAR II
  • Lemma 2.1
  • Definition 2.3: PMM-MAR
  • Proposition 2.1
  • Definition 2.4
  • Definition 2.5
  • Proposition 2.2
  • Lemma 2.2
  • ...and 30 more