Simple Imputation Rules for Prediction with Missing Data: Contrasting Theoretical Guarantees with Empirical Performance
Dimitris Bertsimas, Arthur Delarue, Jean Pauphilet
TL;DR
This paper investigates prediction with missing data by analyzing impute-then-regress pipelines. It proves asymptotic consistency for a broad class of simple imputation rules and shows that mean imputation is asymptotically optimal for continuous features while mode imputation is suboptimal for categorical features. Empirically, the theory largely holds on synthetic and semi-real data, but real data reveal gaps influenced by missingness mechanisms and downstream model choice. The work emphasizes that encoding missingness as information can be beneficial, questions MAR-based guarantees for prediction, and calls for more realistic data-generation models to evaluate imputation strategies.
Abstract
Missing data is a common issue in real-world datasets. This paper studies the performance of impute-then-regress pipelines by contrasting theoretical and empirical evidence. We establish the asymptotic consistency of such pipelines for a broad family of imputation methods. While common sense suggests that a `good' imputation method produces datasets that are plausible, we show, on the contrary, that, as far as prediction is concerned, crude can be good. Among others, we find that mode-impute is asymptotically sub-optimal, while mean-impute is asymptotically optimal. We then exhaustively assess the validity of these theoretical conclusions on a large corpus of synthetic, semi-real, and real datasets. While the empirical evidence we collect mostly supports our theoretical findings, it also highlights gaps between theory and practice and opportunities for future research, regarding the relevance of the MAR assumption, the complex interdependency between the imputation and regression tasks, and the need for realistic synthetic data generation models.
