Table of Contents
Fetching ...

Interpreting Black-box Machine Learning Models for High Dimensional Datasets

Md. Rezaul Karim, Md. Shajalal, Alex Graß, Till Döhmen, Sisay Adugna Chala, Alexander Boden, Christian Beecks, Stefan Decker

TL;DR

High-dimensional datasets render many black-box models opaque and hard to interpret. The authors propose a two-stage approach: learn latent embeddings with a black-box (via a convolutional autoencoder) and then train an interpretable surrogate on a top-k feature subset to generate global rules and local explanations, including counterfactuals. They combine attention-based probing, sensitivity analysis, and surrogate modeling (DT/RF/XGBoost) to achieve faithful explanations, validated across four datasets with high predictive fidelity (R^2 ~0.86–0.94) and strong rule-based interpretability. The work enables scalable, actionable explanations for complex domains such as genomics and sensor data, potentially improving transparency and accountability in high-stakes applications.

Abstract

Deep neural networks (DNNs) have been shown to outperform traditional machine learning algorithms in a broad variety of application domains due to their effectiveness in modeling complex problems and handling high-dimensional datasets. Many real-life datasets, however, are of increasingly high dimensionality, where a large number of features may be irrelevant for both supervised and unsupervised learning tasks. The inclusion of such features would not only introduce unwanted noise but also increase computational complexity. Furthermore, due to high non-linearity and dependency among a large number of features, DNN models tend to be unavoidably opaque and perceived as black-box methods because of their not well-understood internal functioning. Their algorithmic complexity is often simply beyond the capacities of humans to understand the interplay among myriads of hyperparameters. A well-interpretable model can identify statistically significant features and explain the way they affect the model's outcome. In this paper, we propose an efficient method to improve the interpretability of black-box models for classification tasks in the case of high-dimensional datasets. First, we train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed. To decompose the inner working principles of the black-box model and to identify top-k important features, we employ different probing and perturbing techniques. We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space. Finally, we derive decision rules and local explanations from the surrogate model to explain individual decisions. Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets with varying dimensionality between 50 and 20,000 w.r.t metrics and explainability.

Interpreting Black-box Machine Learning Models for High Dimensional Datasets

TL;DR

High-dimensional datasets render many black-box models opaque and hard to interpret. The authors propose a two-stage approach: learn latent embeddings with a black-box (via a convolutional autoencoder) and then train an interpretable surrogate on a top-k feature subset to generate global rules and local explanations, including counterfactuals. They combine attention-based probing, sensitivity analysis, and surrogate modeling (DT/RF/XGBoost) to achieve faithful explanations, validated across four datasets with high predictive fidelity (R^2 ~0.86–0.94) and strong rule-based interpretability. The work enables scalable, actionable explanations for complex domains such as genomics and sensor data, potentially improving transparency and accountability in high-stakes applications.

Abstract

Deep neural networks (DNNs) have been shown to outperform traditional machine learning algorithms in a broad variety of application domains due to their effectiveness in modeling complex problems and handling high-dimensional datasets. Many real-life datasets, however, are of increasingly high dimensionality, where a large number of features may be irrelevant for both supervised and unsupervised learning tasks. The inclusion of such features would not only introduce unwanted noise but also increase computational complexity. Furthermore, due to high non-linearity and dependency among a large number of features, DNN models tend to be unavoidably opaque and perceived as black-box methods because of their not well-understood internal functioning. Their algorithmic complexity is often simply beyond the capacities of humans to understand the interplay among myriads of hyperparameters. A well-interpretable model can identify statistically significant features and explain the way they affect the model's outcome. In this paper, we propose an efficient method to improve the interpretability of black-box models for classification tasks in the case of high-dimensional datasets. First, we train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed. To decompose the inner working principles of the black-box model and to identify top-k important features, we employ different probing and perturbing techniques. We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space. Finally, we derive decision rules and local explanations from the surrogate model to explain individual decisions. Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets with varying dimensionality between 50 and 20,000 w.r.t metrics and explainability.
Paper Structure (16 sections, 11 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 11 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Workflow of our proposed approach (recreated based on Karim et al. karim_phd_thesis_2022)
  • Figure 2: Schematic representation of $SAN_{CAE}$ model (recreated based on Karim et al. karim_phd_thesis_2022)
  • Figure 3: Mean accuracy w.r.t relative dimension of latent space across datasets. Shade indicates standard deviation. The baseline is obtained by training the $TabNet$ model on original feature space (i.e., 100% of the dimensions)
  • Figure 4: Global feature impacts sorted in terms of global feature impacts
  • Figure 5: Decision boundaries for XGboost model across datasets for top-2 features
  • ...and 3 more figures