Table of Contents
Fetching ...

ITI-IQA: a Toolbox for Heterogeneous Univariate and Multivariate Missing Data Imputation Quality Assessment

Pedro Pons-Suñer, Laura Arnal, J. Ramón Navarro-Cerdán, François Signol

TL;DR

ITI-IQA introduces a bias-aware, trainable toolbox for imputing missing data by evaluating univariate and multivariate imputers on a per-feature basis and combining completeness with imputation quality into a joint score $\omega_x$. The framework uses bias-detection tests (KS and Chi-square), a novel APPRandom imputer, and dependency graphs to optimize multivariate imputations, while offering configurable pipelines and visualization tools. Empirical results on the UCI Heart Disease dataset show IQA can produce imputations that are harder to distinguish from observed data compared with MICE and KNN, highlighting its strength in reducing bias, albeit with competitive predictive performance. Overall, IQA provides a flexible, transparent approach to missing-data preprocessing that emphasizes reliability and bias mitigation, with potential for broad adoption as a modular Python library.

Abstract

Missing values are a major challenge in most data science projects working on real data. To avoid losing valuable information, imputation methods are used to fill in missing values with estimates, allowing the preservation of samples or variables that would otherwise be discarded. However, if the process is not well controlled, imputation can generate spurious values that introduce uncertainty and bias into the learning process. The abundance of univariate and multivariate imputation techniques, along with the complex trade-off between data reliability and preservation, makes it difficult to determine the best course of action to tackle missing values. In this work, we present ITI-IQA (Imputation Quality Assessment), a set of utilities designed to assess the reliability of various imputation methods, select the best imputer for any feature or group of features, and filter out features that do not meet quality criteria. Statistical tests are conducted to evaluate the suitability of every tested imputer, ensuring that no new biases are introduced during the imputation phase. The result is a trainable pipeline of filters and imputation methods that streamlines the process of dealing with missing data, supporting different data types: continuous, discrete, binary, and categorical. The toolbox also includes a suite of diagnosing methods and graphical tools to check measurements and results during and after handling missing data.

ITI-IQA: a Toolbox for Heterogeneous Univariate and Multivariate Missing Data Imputation Quality Assessment

TL;DR

ITI-IQA introduces a bias-aware, trainable toolbox for imputing missing data by evaluating univariate and multivariate imputers on a per-feature basis and combining completeness with imputation quality into a joint score . The framework uses bias-detection tests (KS and Chi-square), a novel APPRandom imputer, and dependency graphs to optimize multivariate imputations, while offering configurable pipelines and visualization tools. Empirical results on the UCI Heart Disease dataset show IQA can produce imputations that are harder to distinguish from observed data compared with MICE and KNN, highlighting its strength in reducing bias, albeit with competitive predictive performance. Overall, IQA provides a flexible, transparent approach to missing-data preprocessing that emphasizes reliability and bias mitigation, with potential for broad adoption as a modular Python library.

Abstract

Missing values are a major challenge in most data science projects working on real data. To avoid losing valuable information, imputation methods are used to fill in missing values with estimates, allowing the preservation of samples or variables that would otherwise be discarded. However, if the process is not well controlled, imputation can generate spurious values that introduce uncertainty and bias into the learning process. The abundance of univariate and multivariate imputation techniques, along with the complex trade-off between data reliability and preservation, makes it difficult to determine the best course of action to tackle missing values. In this work, we present ITI-IQA (Imputation Quality Assessment), a set of utilities designed to assess the reliability of various imputation methods, select the best imputer for any feature or group of features, and filter out features that do not meet quality criteria. Statistical tests are conducted to evaluate the suitability of every tested imputer, ensuring that no new biases are introduced during the imputation phase. The result is a trainable pipeline of filters and imputation methods that streamlines the process of dealing with missing data, supporting different data types: continuous, discrete, binary, and categorical. The toolbox also includes a suite of diagnosing methods and graphical tools to check measurements and results during and after handling missing data.
Paper Structure (20 sections, 3 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 3 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Dependency graph containing four features: A, B, C, D. An arrow from A to B indicates that B is dependent on A, or that A is important for predicting B.
  • Figure 2: Example of configuration JSON file for IQA.
  • Figure 3: Example of IQA final quality results. A quality threshold of 0.9 has been set so that col6 and col7 would be removed while, in principle, the rest is accepted. col8, which is a constant variable, can be perfectly imputed despite having been asigned an APPRandom imputer, which is indicated by its red color.
  • Figure 4: Distribution of target values, from 0 to 4, where a value of 0 indicates that the patient has no heart disease.
  • Figure 5: Matrix distribution of missing (blank) and observed (gray) values across the UCI Heart Disease dataset.
  • ...and 3 more figures