Table of Contents
Fetching ...

Simple Imputation Rules for Prediction with Missing Data: Contrasting Theoretical Guarantees with Empirical Performance

Dimitris Bertsimas, Arthur Delarue, Jean Pauphilet

TL;DR

This paper investigates prediction with missing data by analyzing impute-then-regress pipelines. It proves asymptotic consistency for a broad class of simple imputation rules and shows that mean imputation is asymptotically optimal for continuous features while mode imputation is suboptimal for categorical features. Empirically, the theory largely holds on synthetic and semi-real data, but real data reveal gaps influenced by missingness mechanisms and downstream model choice. The work emphasizes that encoding missingness as information can be beneficial, questions MAR-based guarantees for prediction, and calls for more realistic data-generation models to evaluate imputation strategies.

Abstract

Missing data is a common issue in real-world datasets. This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence. We establish the asymptotic consistency of such pipelines for a broad family of imputation methods. While common sense suggests that a `good' imputation method produces datasets that are plausible, we show, on the contrary, that, as far as prediction is concerned, crude can be good. Among others, we find that mode-impute is asymptotically sub-optimal, while mean-impute is asymptotically optimal. We then exhaustively assess the validity of these theoretical conclusions on a large corpus of synthetic, semi-real, and real datasets. While the empirical evidence we collect mostly supports our theoretical findings, it also highlights gaps between theory and practice and opportunities for future research, regarding the relevance of the MAR assumption, the complex interdependency between the imputation and regression tasks, and the need for realistic synthetic data generation models.

Simple Imputation Rules for Prediction with Missing Data: Contrasting Theoretical Guarantees with Empirical Performance

TL;DR

This paper investigates prediction with missing data by analyzing impute-then-regress pipelines. It proves asymptotic consistency for a broad class of simple imputation rules and shows that mean imputation is asymptotically optimal for continuous features while mode imputation is suboptimal for categorical features. Empirically, the theory largely holds on synthetic and semi-real data, but real data reveal gaps influenced by missingness mechanisms and downstream model choice. The work emphasizes that encoding missingness as information can be beneficial, questions MAR-based guarantees for prediction, and calls for more realistic data-generation models to evaluate imputation strategies.

Abstract

Missing data is a common issue in real-world datasets. This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence. We establish the asymptotic consistency of such pipelines for a broad family of imputation methods. While common sense suggests that a `good' imputation method produces datasets that are plausible, we show, on the contrary, that, as far as prediction is concerned, crude can be good. Among others, we find that mode-impute is asymptotically sub-optimal, while mean-impute is asymptotically optimal. We then exhaustively assess the validity of these theoretical conclusions on a large corpus of synthetic, semi-real, and real datasets. While the empirical evidence we collect mostly supports our theoretical findings, it also highlights gaps between theory and practice and opportunities for future research, regarding the relevance of the MAR assumption, the complex interdependency between the imputation and regression tasks, and the need for realistic synthetic data generation models.

Paper Structure

This paper contains 30 sections, 3 theorems, 17 equations, 6 figures, 6 tables.

Key Result

Theorem 2.1

Consider a universally consistent learning algorithm when trained on any fully observed dataset. Systematically imputing $\mu(\bm{x}_{2:d})$ for $X_1 | \bm{X}_{2:d} = \bm{x}_{2:d}$ on the training set and training a predictor on the imputed dataset leads, in the limit with infinite data, to the foll if $x_1 \neq \mu(\bm{x}_{2:d})$ and otherwise, where

Figures (6)

  • Figure 1: Average out-of-sample $R^2$ of mice-then-regress and mean-impute-then-regress on fully synthetic data, as the proportion of missing entries increases. Results are averaged over 50 different sample size and 10 training/test splits.
  • Figure 2: Out-of-sample $R^2$ on mice-regress and mean-impute-then-regress on synthetic data with non-linear signal, NMAR data, and 40% of missing entries, as the number of samples $n$ increases. We report the performance of two different downstream predictors: Linear (LASSO) regression and random forest. Results are averaged over 10 training/test splits.
  • Figure 3: Average out-of-sample $R^2$ of XGBoost with mean impute, mice, or no imputation method, on synthetic data with non-linear signal, NMAR missing data, and 40% of missing entries, as the number of samples $n$ increases. Results are averaged over 10 training/test splits.
  • Figure 3.1: Graphical representation of the 4 experimental designs implemented in our benchmark simulations with real-world design matrix $\bm{X}$. Solid (resp. dashed) lines correspond to correlations explicitly (resp. not explicitly) controlled in our experiments.
  • Figure 5.1: Difference in out-of-sample $R^2$ between mice- and mean-impute-then-regress on fully synthetic data, as the proportion of missing entries and the sample size vary. A green/positive value indicates that mice is more accurate.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 2.1
  • Corollary 2.2
  • Theorem 4.1
  • Example 4.2: Prediction benefits from NMAR
  • Example 4.3: Prediction benefits from MAR