Explainability of Machine Learning Models under Missing Data

Tuan L. Vo; Thu Nguyen; Luis M. Lopez-Ramos; Hugo L. Hammer; Michael A. Riegler; Pal Halvorsen

Explainability of Machine Learning Models under Missing Data

Tuan L. Vo, Thu Nguyen, Luis M. Lopez-Ramos, Hugo L. Hammer, Michael A. Riegler, Pal Halvorsen

TL;DR

This work addresses the interplay between missing data handling and explainability by studying how imputation methods affect SHAP-based attributions. It combines theoretical analysis under MCAR with large-scale experiments across regression and classification tasks, comparing six imputation methods and XGBoost's direct-missing-data capability. The findings show that imputation choice can substantially alter Shapley values, and that methods delivering the best predictive MSE do not always preserve the original feature importance structure, underscoring the need to tailor imputation to analysis goals. Practically, the paper provides guidance for practitioners on selecting imputation strategies compatible with their data characteristics and explainability objectives, and highlights the potential of DIMV and similar approaches to better preserve explanations.

Abstract

Missing data is a prevalent issue that can significantly impair model performance and explainability. This paper briefly summarizes the development of the field of missing data with respect to Explainable Artificial Intelligence and experimentally investigates the effects of various imputation methods on SHAP (SHapley Additive exPlanations), a popular technique for explaining the output of complex machine learning models. Next, we compare different imputation strategies and assess their impact on feature importance and interaction as determined by Shapley values. Moreover, we also theoretically analyze the effects of missing values on Shapley values. Importantly, our findings reveal that the choice of imputation method can introduce biases that could lead to changes in the Shapley values, thereby affecting the explainability of the model. Moreover, we also show that a lower test prediction MSE (Mean Square Error) does not necessarily imply a lower MSE in Shapley values and vice versa. Also, while XGBoost (eXtreme Gradient Boosting) is a method that could handle missing data directly, using XGBoost directly on missing data can seriously affect explainability compared to imputing the data before training XGBoost. This study provides a comprehensive evaluation of imputation methods in the context of model explanations, offering practical guidance for selecting appropriate techniques based on dataset characteristics and analysis objectives. The results underscore the importance of considering imputation effects to ensure robust and reliable insights from machine learning models.

Explainability of Machine Learning Models under Missing Data

TL;DR

Abstract

Paper Structure (23 sections, 1 theorem, 23 equations, 16 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 1 theorem, 23 equations, 16 figures, 3 tables, 1 algorithm.

Introduction
Related works
Explainable AI
Missing data imputation techniques
Direct missing data handling techniques without imputation
Studies on the impact of imputation on model interpretation
Methods
Shapley values
Imputation techniques
Theoretical analysis
Experiments
Experiment settings
Global feature importance plot analysis
Beeswarm plot analysis
Beeswarm plot for the California dataset
...and 8 more sections

Key Result

Theorem 1

The global feature importance of $\mathbf{x}'$ on $\mathbf{z'}$ can be simplified to

Figures (16)

Figure 1: Global feature importance plot on the California dataset with the missing rate $r=0.2$
Figure 2: Global feature importance plot on the California dataset with the missing rate $r=0.4$
Figure 3: Global feature importance plot on the California dataset with the missing rate $r=0.6$
Figure 4: Global feature importance plot on the California dataset with the missing rate $r=0.8$
Figure 5: Beeswarm plots for the California dataset at missing rate $r=0.2$
...and 11 more figures

Theorems & Definitions (2)

Theorem 1
proof

Explainability of Machine Learning Models under Missing Data

TL;DR

Abstract

Explainability of Machine Learning Models under Missing Data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (2)