Table of Contents
Fetching ...

Imputation using training labels and classification via label imputation

Thu Nguyen, Tuan L. Vo, Pål Halvorsen, Michael A. Riegler

TL;DR

The paper tackles missing data by exploiting training labels during imputation. It introduces CBMI, which predicts test labels by jointly imputing stacked training and test data with MissForest, and IUL, which augments inputs with labels to improve imputation quality and can be paired with any imputation method. Empirical results show CBMI often improves classification accuracy, especially on imbalanced or categorical data, while IUL yields lower MSE and better classification as missingness increases. Limitations include the pre-collected-data setting and MNAR scenarios, with future work aiming to extend these ideas to semi-supervised learning and other imputers.

Abstract

Missing data is a common problem in practical data science settings. Various imputation methods have been developed to deal with missing data. However, even though the labels are available in the training data in many situations, the common practice of imputation usually only relies on the input and ignores the label. We propose Classification Based on MissForest Imputation (CBMI), a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation, allowing the label and the input to be imputed simultaneously. In addition, we propose the imputation using labels (IUL) algorithm, an imputation strategy that stacks the label into the input and illustrates how it can significantly improve the imputation quality. Experiments show that CBMI has classification accuracy when the test set contains missing data, especially for imbalanced data and categorical data. Moreover, for both the regression and classification, IUL consistently shows significantly better results than imputation based on only the input data.

Imputation using training labels and classification via label imputation

TL;DR

The paper tackles missing data by exploiting training labels during imputation. It introduces CBMI, which predicts test labels by jointly imputing stacked training and test data with MissForest, and IUL, which augments inputs with labels to improve imputation quality and can be paired with any imputation method. Empirical results show CBMI often improves classification accuracy, especially on imbalanced or categorical data, while IUL yields lower MSE and better classification as missingness increases. Limitations include the pre-collected-data setting and MNAR scenarios, with future work aiming to extend these ideas to semi-supervised learning and other imputers.

Abstract

Missing data is a common problem in practical data science settings. Various imputation methods have been developed to deal with missing data. However, even though the labels are available in the training data in many situations, the common practice of imputation usually only relies on the input and ignores the label. We propose Classification Based on MissForest Imputation (CBMI), a classification strategy that initializes the predicted test label with missing values and stacks the label with the input for imputation, allowing the label and the input to be imputed simultaneously. In addition, we propose the imputation using labels (IUL) algorithm, an imputation strategy that stacks the label into the input and illustrates how it can significantly improve the imputation quality. Experiments show that CBMI has classification accuracy when the test set contains missing data, especially for imbalanced data and categorical data. Moreover, for both the regression and classification, IUL consistently shows significantly better results than imputation based on only the input data.
Paper Structure (11 sections, 1 theorem, 6 equations, 4 figures, 3 tables, 2 algorithms)

This paper contains 11 sections, 1 theorem, 6 equations, 4 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Assume that we have a data $\mathcal{D}=(\mathbf{x},\mathbf{z},\mathbf{y})$ of $n$ samples. Here, $\mathbf{x}$ contains missing values, $\mathbf{z}$ is fully observed, and $\mathbf{y}$ is a label feature. For MICE imputation, with the IUL strategy, we construct the model $\hat{x}=\hat{\gamma}_o+\hat Here, the value of $\mathcal{E}_i$ could be negative or non-negative. Thus, we distinguish between

Figures (4)

  • Figure 1: Performance of IUL compared to DI with missForest for classification tasks.
  • Figure 2: Performance of IUL compared to DI with missForest for regression tasks.
  • Figure 3: Performance of IUL compared to DI with MICE for classification tasks.
  • Figure 4: Performance of IUL compared to DI with MICE for regression tasks.

Theorems & Definitions (1)

  • Theorem 1