Table of Contents
Fetching ...

Explainability of Machine Learning Models under Missing Data

Tuan L. Vo, Thu Nguyen, Luis M. Lopez-Ramos, Hugo L. Hammer, Michael A. Riegler, Pal Halvorsen

TL;DR

This work addresses the interplay between missing data handling and explainability by studying how imputation methods affect SHAP-based attributions. It combines theoretical analysis under MCAR with large-scale experiments across regression and classification tasks, comparing six imputation methods and XGBoost's direct-missing-data capability. The findings show that imputation choice can substantially alter Shapley values, and that methods delivering the best predictive MSE do not always preserve the original feature importance structure, underscoring the need to tailor imputation to analysis goals. Practically, the paper provides guidance for practitioners on selecting imputation strategies compatible with their data characteristics and explainability objectives, and highlights the potential of DIMV and similar approaches to better preserve explanations.

Abstract

Missing data is a prevalent issue that can significantly impair model performance and explainability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on SHAP (SHapley Additive exPlanations), a popular technique for explaining the output of complex machine learning models. Next, we compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the explainability of the model. Moreover, we also show that a lower test prediction MSE (Mean Square Error) does not necessarily imply a lower MSE in Shapley values and vice versa. Also, while XGBoost (eXtreme Gradient Boosting) is a method that could handle missing data directly, using XGBoost directly on missing data can seriously affect explainability compared to imputing the data before training XGBoost. This study provides a comprehensive evaluation of imputation methods in the context of model explanations, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

Explainability of Machine Learning Models under Missing Data

TL;DR

This work addresses the interplay between missing data handling and explainability by studying how imputation methods affect SHAP-based attributions. It combines theoretical analysis under MCAR with large-scale experiments across regression and classification tasks, comparing six imputation methods and XGBoost's direct-missing-data capability. The findings show that imputation choice can substantially alter Shapley values, and that methods delivering the best predictive MSE do not always preserve the original feature importance structure, underscoring the need to tailor imputation to analysis goals. Practically, the paper provides guidance for practitioners on selecting imputation strategies compatible with their data characteristics and explainability objectives, and highlights the potential of DIMV and similar approaches to better preserve explanations.

Abstract

Missing data is a prevalent issue that can significantly impair model performance and explainability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on SHAP (SHapley Additive exPlanations), a popular technique for explaining the output of complex machine learning models. Next, we compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the explainability of the model. Moreover, we also show that a lower test prediction MSE (Mean Square Error) does not necessarily imply a lower MSE in Shapley values and vice versa. Also, while XGBoost (eXtreme Gradient Boosting) is a method that could handle missing data directly, using XGBoost directly on missing data can seriously affect explainability compared to imputing the data before training XGBoost. This study provides a comprehensive evaluation of imputation methods in the context of model explanations, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.
Paper Structure (23 sections, 1 theorem, 23 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 1 theorem, 23 equations, 16 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

The global feature importance of $\mathbf{x}'$ on $\mathbf{z'}$ can be simplified to

Figures (16)

  • Figure 1: Global feature importance plot on the California dataset with the missing rate $r=0.2$
  • Figure 2: Global feature importance plot on the California dataset with the missing rate $r=0.4$
  • Figure 3: Global feature importance plot on the California dataset with the missing rate $r=0.6$
  • Figure 4: Global feature importance plot on the California dataset with the missing rate $r=0.8$
  • Figure 5: Beeswarm plots for the California dataset at missing rate $r=0.2$
  • ...and 11 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof