BEExAI: Benchmark to Evaluate Explainable AI
Samuel Sithakoul, Sara Meftah, Clément Feutry
TL;DR
BEExAI introduces an open-source, end-to-end benchmark for evaluating post-hoc explainable AI methods on tabular data, addressing the lack of standardized quantitative metrics for explanations. The framework standardizes data preprocessing, model training, explanation generation, and metric-based evaluation across $9$ metrics and $8$ explainers, supporting regression, binary classification, and multi-label tasks on a diverse set of tabular datasets. Key contributions include a comprehensive, reproducible benchmarking pipeline, task-specific attribution handling, a curated set of interpretable metrics anchored to Faithfulness, Robustness, and Complexity, and extensive results showing how methods like SHAP, LIME, and Saliency perform differently across models and tasks. BEExAI enables practitioners to select explainers aligned with their task and model, while providing a scalable platform for future extensions to NLP/vision and human Plausibility studies, thereby advancing standardized evaluation in XAI.
Abstract
Recent research in explainability has given rise to numerous post-hoc attribution methods aimed at enhancing our comprehension of the outputs of black-box machine learning models. However, evaluating the quality of explanations lacks a cohesive approach and a consensus on the methodology for deriving quantitative metrics that gauge the efficacy of explainability post-hoc attribution methods. Furthermore, with the development of increasingly complex deep learning models for diverse data applications, the need for a reliable way of measuring the quality and correctness of explanations is becoming critical. We address this by proposing BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods, employing a set of selected evaluation metrics.
