Table of Contents
Fetching ...

Learning Accurate Models on Incomplete Data with Minimal Imputation

Cheng Zhen, Nischal Aryal, Arash Termehchy, Prayoga, Garrett Biwer, Sankalp Patil

TL;DR

This work addresses learning from data with missing values by introducing minimal imputation, the smallest set of missing entries whose imputation preserves the model learned from fully imputed data. It provides formal definitions and NP-hardness proofs for exact minimal imputation in both SVM and linear regression, and offers exact and practical approximate algorithms leveraging edge repairs and OMP-style feature selection, with incremental update strategies. The proposed methods reduce data-cleaning costs while maintaining downstream predictive accuracy, as demonstrated on diverse real-world datasets and against multiple baselines, including ActiveClean and common imputation approaches. Together, these results highlight the practical value of imputing only the truly necessary missing data to achieve reliable models in real-world incomplete datasets.

Abstract

Missing data often exists in real-world datasets, requiring significant time and effort for imputation to learn accurate machine learning (ML) models. In this paper, we demonstrate that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce the concept of minimal data imputation, which ensures accurate ML models trained over the imputed dataset. Implementing minimal imputation guarantees both minimal imputation effort and optimal ML models. We propose algorithms to find exact and approximate minimal imputation for various ML models. Our extensive experiments indicate that our proposed algorithms significantly reduce the time and effort required for data imputation.

Learning Accurate Models on Incomplete Data with Minimal Imputation

TL;DR

This work addresses learning from data with missing values by introducing minimal imputation, the smallest set of missing entries whose imputation preserves the model learned from fully imputed data. It provides formal definitions and NP-hardness proofs for exact minimal imputation in both SVM and linear regression, and offers exact and practical approximate algorithms leveraging edge repairs and OMP-style feature selection, with incremental update strategies. The proposed methods reduce data-cleaning costs while maintaining downstream predictive accuracy, as demonstrated on diverse real-world datasets and against multiple baselines, including ActiveClean and common imputation approaches. Together, these results highlight the practical value of imputing only the truly necessary missing data to achieve reliable models in real-world incomplete datasets.

Abstract

Missing data often exists in real-world datasets, requiring significant time and effort for imputation to learn accurate machine learning (ML) models. In this paper, we demonstrate that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce the concept of minimal data imputation, which ensures accurate ML models trained over the imputed dataset. Implementing minimal imputation guarantees both minimal imputation effort and optimal ML models. We propose algorithms to find exact and approximate minimal imputation for various ML models. Our extensive experiments indicate that our proposed algorithms significantly reduce the time and effort required for data imputation.

Paper Structure

This paper contains 53 sections, 10 theorems, 13 equations, 1 figure, 6 tables, 2 algorithms.

Key Result

Theorem 1

(Uniqueness of the minimal imputation set) Given the training set training set $(\mathbf{X}, \mathbf{y})$ and regularization parameter $C$, $\mathbf{S}_{min}(\mathbf{X}, \mathbf{y},C)$ is unique.

Figures (1)

  • Figure 1: The necessity of missing data imputation varies

Theorems & Definitions (18)

  • Example 1
  • Definition 1
  • Example 2
  • Example 3
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Definition 4
  • Theorem 3
  • ...and 8 more