Table of Contents
Fetching ...

Beyond Accuracy: An Empirical Study of Uncertainty Estimation in Imputation

Zarin Tahia Hossain, Mostafa Milani

TL;DR

This study tackles uncertainty estimation in data imputation by empirically evaluating six representative imputers (MICE, SoftImpute, OT-Impute, GAIN, MIWAE, TabCSDI) under MCAR, MAR, and MNAR missingness. Uncertainty is extracted via multi-run, conditional sampling, and predictive-distribution modeling, with evaluation based on calibration curves and $ECE$ in addition to $MAE$. Findings reveal a persistent misalignment between reconstruction accuracy and uncertainty calibration: some high-accuracy methods yield poorly calibrated uncertainty, while others provide better-calibrated predictions at a cost to point accuracy. The results offer practical guidance on selecting uncertainty-aware imputers depending on the application's emphasis on reliability versus speed, and highlight the need for calibrated uncertainty in data cleaning and downstream ML pipelines. This work establishes a unified framework for comparing uncertainty in imputation across diverse methodological families and missingness regimes.

Abstract

Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of these uncertainty estimates remain poorly understood. This paper presents a systematic empirical study of uncertainty in imputation, comparing representative methods from three major families: statistical (MICE, SoftImpute), distribution alignment (OT-Impute), and deep generative (GAIN, MIWAE, TabCSDI). Experiments span multiple datasets, missingness mechanisms (MCAR, MAR, MNAR), and missingness rates. Uncertainty is estimated through three complementary routes: multi-run variability, conditional sampling, and predictive-distribution modeling, and evaluated using calibration curves and the Expected Calibration Error (ECE). Results show that accuracy and calibration are often misaligned: models with high reconstruction accuracy do not necessarily yield reliable uncertainty. We analyze method-specific trade-offs among accuracy, calibration, and runtime, identify stable configurations, and offer guidelines for selecting uncertainty-aware imputers in data cleaning and downstream machine learning pipelines.

Beyond Accuracy: An Empirical Study of Uncertainty Estimation in Imputation

TL;DR

This study tackles uncertainty estimation in data imputation by empirically evaluating six representative imputers (MICE, SoftImpute, OT-Impute, GAIN, MIWAE, TabCSDI) under MCAR, MAR, and MNAR missingness. Uncertainty is extracted via multi-run, conditional sampling, and predictive-distribution modeling, with evaluation based on calibration curves and in addition to . Findings reveal a persistent misalignment between reconstruction accuracy and uncertainty calibration: some high-accuracy methods yield poorly calibrated uncertainty, while others provide better-calibrated predictions at a cost to point accuracy. The results offer practical guidance on selecting uncertainty-aware imputers depending on the application's emphasis on reliability versus speed, and highlight the need for calibrated uncertainty in data cleaning and downstream ML pipelines. This work establishes a unified framework for comparing uncertainty in imputation across diverse methodological families and missingness regimes.

Abstract

Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of these uncertainty estimates remain poorly understood. This paper presents a systematic empirical study of uncertainty in imputation, comparing representative methods from three major families: statistical (MICE, SoftImpute), distribution alignment (OT-Impute), and deep generative (GAIN, MIWAE, TabCSDI). Experiments span multiple datasets, missingness mechanisms (MCAR, MAR, MNAR), and missingness rates. Uncertainty is estimated through three complementary routes: multi-run variability, conditional sampling, and predictive-distribution modeling, and evaluated using calibration curves and the Expected Calibration Error (ECE). Results show that accuracy and calibration are often misaligned: models with high reconstruction accuracy do not necessarily yield reliable uncertainty. We analyze method-specific trade-offs among accuracy, calibration, and runtime, identify stable configurations, and offer guidelines for selecting uncertainty-aware imputers in data cleaning and downstream machine learning pipelines.

Paper Structure

This paper contains 16 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: ECE vs. n-runs at 30% MCAR.
  • Figure 2: ECE vs. number of samples at 30% MCAR.
  • Figure 3: Time per run at 30% missingness vs Classical methods report total time; deep models report train$+$single imputation.
  • Figure 4: Runtime vs. n-samples at 30% MCAR in wine.
  • Figure 5: MAE vs. missing rate for MCAR.
  • ...and 4 more figures