Table of Contents
Fetching ...

Common Steps in Machine Learning Might Hinder The Explainability Aims in Medicine

Ahmed M Salih

TL;DR

The paper investigates how standard data preprocessing steps, designed to improve ML performance, can undermine explainability in medical applications. It surveys missing-value handling, outlier treatment, data augmentation, normalization/standardization, feature selection, PCA, and confounding variables, outlining how each affects interpretability and clinical trust. It discusses practical trade-offs and proposes recommendations—such as multi-imputation evaluation, preserving clinically meaningful extremes, fairness-aware augmentation, and careful handling of confounders—to maintain explainability alongside accuracy. The work highlights the need for explainable, transparent medical AI that balances performance with interpretability and patient safety.

Abstract

Data pre-processing is a significant step in machine learning to improve the performance of the model and decreases the running time. This might include dealing with missing values, outliers detection and removing, data augmentation, dimensionality reduction, data normalization and handling the impact of confounding variables. Although it is found the steps improve the accuracy of the model, but they might hinder the explainability of the model if they are not carefully considered especially in medicine. They might block new findings when missing values and outliers removal are implemented inappropriately. In addition, they might make the model unfair against all the groups in the model when making the decision. Moreover, they turn the features into unitless and clinically meaningless and consequently not explainable. This paper discusses the common steps of the data preprocessing in machine learning and their impacts on the explainability and interpretability of the model. Finally, the paper discusses some possible solutions that improve the performance of the model while not decreasing its explainability.

Common Steps in Machine Learning Might Hinder The Explainability Aims in Medicine

TL;DR

The paper investigates how standard data preprocessing steps, designed to improve ML performance, can undermine explainability in medical applications. It surveys missing-value handling, outlier treatment, data augmentation, normalization/standardization, feature selection, PCA, and confounding variables, outlining how each affects interpretability and clinical trust. It discusses practical trade-offs and proposes recommendations—such as multi-imputation evaluation, preserving clinically meaningful extremes, fairness-aware augmentation, and careful handling of confounders—to maintain explainability alongside accuracy. The work highlights the need for explainable, transparent medical AI that balances performance with interpretability and patient safety.

Abstract

Data pre-processing is a significant step in machine learning to improve the performance of the model and decreases the running time. This might include dealing with missing values, outliers detection and removing, data augmentation, dimensionality reduction, data normalization and handling the impact of confounding variables. Although it is found the steps improve the accuracy of the model, but they might hinder the explainability of the model if they are not carefully considered especially in medicine. They might block new findings when missing values and outliers removal are implemented inappropriately. In addition, they might make the model unfair against all the groups in the model when making the decision. Moreover, they turn the features into unitless and clinically meaningless and consequently not explainable. This paper discusses the common steps of the data preprocessing in machine learning and their impacts on the explainability and interpretability of the model. Finally, the paper discusses some possible solutions that improve the performance of the model while not decreasing its explainability.
Paper Structure (11 sections, 7 figures)

This paper contains 11 sections, 7 figures.

Figures (7)

  • Figure 1: Several missing values in tabular data.
  • Figure 2: An outlier that is deviated from the population.
  • Figure 3: Data augmentation approaches.
  • Figure 4: Visual representation of normalization and standardization methods.
  • Figure 5: Feature selection steps and the final set.
  • ...and 2 more figures