Table of Contents
Fetching ...

Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics

Lukas Klein, Carsten T. Lüth, Udo Schlegel, Till J. Bungert, Mennatallah El-Assady, Paul F. Jäger

TL;DR

This work introduces LATEC, a large-scale benchmark for explainable AI (XAI) that jointly evaluates $17$ attribution and attention methods across $20$ metrics, yielding $7{,}560$ design-parameter combinations and $326{,}790$ saliency maps with $378{,}000$ metric scores. It develops a robust, aggregation-first evaluation scheme to mitigate metric-selection biases and to reveal reliable trends across both agreeing and disagreeing metrics. Key findings show that while Expected Gradients often ranks highly in faithfulness and robustness, no method dominates all criteria, and rankings generalize across datasets and architectures but can vary by modality; attention methods exhibit higher robustness yet larger metric disagreements. The authors release the full LATEC dataset and tooling to enable standardized, large-scale benchmarking of XAI methods and metrics, advancing both practical method selection and future research directions in XAI evaluation.

Abstract

Explainable AI (XAI) is a rapidly growing domain with a myriad of proposed methods as well as metrics aiming to evaluate their efficacy. However, current studies are often of limited scope, examining only a handful of XAI methods and ignoring underlying design parameters for performance, such as the model architecture or the nature of input data. Moreover, they often rely on one or a few metrics and neglect thorough validation, increasing the risk of selection bias and ignoring discrepancies among metrics. These shortcomings leave practitioners confused about which method to choose for their problem. In response, we introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We systematically incorporate vital design parameters like varied architectures and diverse input modalities, resulting in 7,560 examined combinations. Through LATEC, we showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme. Further, we comprehensively evaluate various XAI methods to assist practitioners in selecting appropriate methods aligning with their needs. Curiously, the emerging top-performing method, Expected Gradients, is not examined in any relevant related study. LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-)evaluation dataset. The benchmark is hosted at: https://github.com/IML-DKFZ/latec.

Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics

TL;DR

This work introduces LATEC, a large-scale benchmark for explainable AI (XAI) that jointly evaluates attribution and attention methods across metrics, yielding design-parameter combinations and saliency maps with metric scores. It develops a robust, aggregation-first evaluation scheme to mitigate metric-selection biases and to reveal reliable trends across both agreeing and disagreeing metrics. Key findings show that while Expected Gradients often ranks highly in faithfulness and robustness, no method dominates all criteria, and rankings generalize across datasets and architectures but can vary by modality; attention methods exhibit higher robustness yet larger metric disagreements. The authors release the full LATEC dataset and tooling to enable standardized, large-scale benchmarking of XAI methods and metrics, advancing both practical method selection and future research directions in XAI evaluation.

Abstract

Explainable AI (XAI) is a rapidly growing domain with a myriad of proposed methods as well as metrics aiming to evaluate their efficacy. However, current studies are often of limited scope, examining only a handful of XAI methods and ignoring underlying design parameters for performance, such as the model architecture or the nature of input data. Moreover, they often rely on one or a few metrics and neglect thorough validation, increasing the risk of selection bias and ignoring discrepancies among metrics. These shortcomings leave practitioners confused about which method to choose for their problem. In response, we introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We systematically incorporate vital design parameters like varied architectures and diverse input modalities, resulting in 7,560 examined combinations. Through LATEC, we showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme. Further, we comprehensively evaluate various XAI methods to assist practitioners in selecting appropriate methods aligning with their needs. Curiously, the emerging top-performing method, Expected Gradients, is not examined in any relevant related study. LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-)evaluation dataset. The benchmark is hosted at: https://github.com/IML-DKFZ/latec.
Paper Structure (47 sections, 6 equations, 14 figures, 13 tables)

This paper contains 47 sections, 6 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Structure of the LATEC framework including all design parameters and the output data of each stage provided as the LATEC dataset. Final rankings are analyzed in the benchmark.
  • Figure 2: a. Ranking of four XAI methods based on all evaluation metrics of each criterion for one specific set of design parameters. b. Average standard deviation per model architectures and utilized datasets for the imaging modality. The weighted average per column is based on the number of metrics per criterion. c. Proportion of accepted one-sided Levene-Tests for significantly smaller ranking variance compared to the variance of an entire random ranking. Larger values show higher agreement between metrics. The weighted average is based on the number of metrics per criterion.
  • Figure 3: Illustrative saliency maps for all three modalities. The upper row shows three attributions, respectively, and the lower row, three attention-based methods. We observe how all XAI methods highlight the runway in the image and the vessel for the volume modality but with different granularity and focus. For the point cloud plane, explanations are less understandable, with attribution methods highlighting single points at the front tip, rudder, or wing tips.
  • Figure 4: Example of KMeans clustering for point cloud data with k=16.
  • Figure 5: X-axis traversal of point clouds for continuity metric. We can not remove points as this would change the input dimensionality, thus we map them to the center (0,0,0), which is similar to black padding for image and volume data.
  • ...and 9 more figures