Table of Contents
Fetching ...

Iterative missing value imputation based on feature importance

Cong Guo, Chun Liu, Wei Yang

TL;DR

The paper tackles missing-value imputation by incorporating feature importance into the imputation process. It introduces Iterative Weighted Matrix Completion (IWMC), which alternates matrix completion (M-stage) with feature-weight learning (W-stage), using a weighted matrix factorization loss and NCFS-based weight updates. The key contributions are a formal IWMC framework, closed-form updates for the M-stage, and NCFS-driven weighting in the W-stage, validated on synthetic and real-world datasets where IWMC consistently outperforms five baselines in downstream tasks. This approach advances data preprocessing by aligning imputation with feature relevance, improving downstream feature selection and classification and offering a practical tool for high-dimensional data with missing values.

Abstract

Many datasets suffer from missing values due to various reasons,which not only increases the processing difficulty of related tasks but also reduces the accuracy of classification. To address this problem, the mainstream approach is to use missing value imputation to complete the dataset. Existing imputation methods estimate the missing parts based on the observed values in the original feature space, and they treat all features as equally important during data completion, while in fact different features have different importance. Therefore, we have designed an imputation method that considers feature importance. This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance. Our experimental analysis involves three types of datasets: synthetic datasets with different noisy features and missing values, real-world datasets with artificially generated missing values, and real-world datasets originally containing missing values. The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.To the best of our knowledge, this is the first work that considers feature importance in the imputation model.

Iterative missing value imputation based on feature importance

TL;DR

The paper tackles missing-value imputation by incorporating feature importance into the imputation process. It introduces Iterative Weighted Matrix Completion (IWMC), which alternates matrix completion (M-stage) with feature-weight learning (W-stage), using a weighted matrix factorization loss and NCFS-based weight updates. The key contributions are a formal IWMC framework, closed-form updates for the M-stage, and NCFS-driven weighting in the W-stage, validated on synthetic and real-world datasets where IWMC consistently outperforms five baselines in downstream tasks. This approach advances data preprocessing by aligning imputation with feature relevance, improving downstream feature selection and classification and offering a practical tool for high-dimensional data with missing values.

Abstract

Many datasets suffer from missing values due to various reasons,which not only increases the processing difficulty of related tasks but also reduces the accuracy of classification. To address this problem, the mainstream approach is to use missing value imputation to complete the dataset. Existing imputation methods estimate the missing parts based on the observed values in the original feature space, and they treat all features as equally important during data completion, while in fact different features have different importance. Therefore, we have designed an imputation method that considers feature importance. This algorithm iteratively performs matrix completion and feature importance learning, and specifically, matrix completion is based on a filling loss that incorporates feature importance. Our experimental analysis involves three types of datasets: synthetic datasets with different noisy features and missing values, real-world datasets with artificially generated missing values, and real-world datasets originally containing missing values. The results on these datasets consistently show that the proposed method outperforms the existing five imputation algorithms.To the best of our knowledge, this is the first work that considers feature importance in the imputation model.
Paper Structure (18 sections, 18 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 18 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Success rate of relevant features selected on the synthetic dataset with 5% of missing data.
  • Figure 2: Success rate of relevant features selected on the synthetic dataset with 20% of missing data.
  • Figure 3: The standard deviation of the success rate of relevant feature selected on the synthetic dataset with 5% of missing data.
  • Figure 4: The standard deviation of the success rate of relevant feature selected on the synthetic dataset with 20% of missing data.
  • Figure 5: Average success rate of relevant features selected on synthetic dataset with MCAR data.
  • ...and 3 more figures