BEExAI: Benchmark to Evaluate Explainable AI

Samuel Sithakoul; Sara Meftah; Clément Feutry

BEExAI: Benchmark to Evaluate Explainable AI

Samuel Sithakoul, Sara Meftah, Clément Feutry

TL;DR

BEExAI introduces an open-source, end-to-end benchmark for evaluating post-hoc explainable AI methods on tabular data, addressing the lack of standardized quantitative metrics for explanations. The framework standardizes data preprocessing, model training, explanation generation, and metric-based evaluation across $9$ metrics and $8$ explainers, supporting regression, binary classification, and multi-label tasks on a diverse set of tabular datasets. Key contributions include a comprehensive, reproducible benchmarking pipeline, task-specific attribution handling, a curated set of interpretable metrics anchored to Faithfulness, Robustness, and Complexity, and extensive results showing how methods like SHAP, LIME, and Saliency perform differently across models and tasks. BEExAI enables practitioners to select explainers aligned with their task and model, while providing a scalable platform for future extensions to NLP/vision and human Plausibility studies, thereby advancing standardized evaluation in XAI.

Abstract

Recent research in explainability has given rise to numerous post-hoc attribution methods aimed at enhancing our comprehension of the outputs of black-box machine learning models. However, evaluating the quality of explanations lacks a cohesive approach and a consensus on the methodology for deriving quantitative metrics that gauge the efficacy of explainability post-hoc attribution methods. Furthermore, with the development of increasingly complex deep learning models for diverse data applications, the need for a reliable way of measuring the quality and correctness of explanations is becoming critical. We address this by proposing BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods, employing a set of selected evaluation metrics.

BEExAI: Benchmark to Evaluate Explainable AI

TL;DR

metrics and

explainers, supporting regression, binary classification, and multi-label tasks on a diverse set of tabular datasets. Key contributions include a comprehensive, reproducible benchmarking pipeline, task-specific attribution handling, a curated set of interpretable metrics anchored to Faithfulness, Robustness, and Complexity, and extensive results showing how methods like SHAP, LIME, and Saliency perform differently across models and tasks. BEExAI enables practitioners to select explainers aligned with their task and model, while providing a scalable platform for future extensions to NLP/vision and human Plausibility studies, thereby advancing standardized evaluation in XAI.

Abstract

Paper Structure (29 sections, 8 equations, 4 figures, 5 tables)

This paper contains 29 sections, 8 equations, 4 figures, 5 tables.

Introduction
BEExAI library
XAI evaluation
Choice of baseline
Experiments
Implementation details
Main External Libraries Utilized for BEExAI Implementation
Computational details
Sanity check
Datasets
Benchmarking procedure
Data:
ML Predictive Models
Predictive Models’ evaluation
XAI methods’ evaluation
...and 14 more sections

Figures (4)

Figure 1: Sufficiency (left) and Comprehensiveness (right) curves on Diamonds dataset for regression task. These curves represent the average values obtained from 1000 samples with XGBoost model and ShapleyValueSampling method. Sufficiency (Comprehensiveness) values refer to the absolute difference between the ML model’s predicted value when all features are used (removed) and the predicted values of the same model when the features are successively removed (added) in an ascending order based on their corresponding explainability attributions. Top: Sufficiency and Comprehensiveness values based on absolute values of attributions. Middle: Sufficiency and Comprehensiveness values based on raw attributions. Bottom: Sufficiency and Comprehensiveness for random attributions.
Figure 2: Radar plot for explainability metrics values for the Diamonds dataset within a regression task scenario using a NNs model. Sensitivity metric values have been scaled up by a factor of 100.
Figure 3: Comparative performance of XAI methods: Frequency (in terms of percentage) of top-1 rankings across evaluation metrics for XGBoost (left) and NNs (right). XAI methods with missing bars indicate a top-1 ranking frequency close to 0%
Figure 4: Comparative performance of XAI methods: Frequency (in terms of percentage) of a combined count of top-1 and top-2 rankings across evaluation metrics for XGBoost (left) and NNs (right).

BEExAI: Benchmark to Evaluate Explainable AI

TL;DR

Abstract

BEExAI: Benchmark to Evaluate Explainable AI

Authors

TL;DR

Abstract

Table of Contents

Figures (4)