Table of Contents
Fetching ...

Saliency-Bench: A Comprehensive Benchmark for Evaluating Visual Explanations

Yifei Zhang, James Song, Siyi Gu, Tianxu Jiang, Bo Pan, Guangji Bai, Liang Zhao

TL;DR

Saliency-Bench addresses the fragmentation in evaluating visual explanations by introducing a standardized benchmark with eight diverse, annotated datasets and a unified evaluation pipeline that jointly measures alignment and faithfulness of saliency maps. It benchmarks multiple saliency methods, including GradCAM, GradCAM++, Integrated Gradients, InputXGradient, Occlusion, RISE, and ViT attention, across CNN and transformer architectures, revealing dataset- and model-dependent strengths and limitations. The work provides an easy-to-use API (xaibenchmark) to load data, generate explanations, and compute metrics, enabling reproducible comparisons and accelerating progress in XAI. By systematically analyzing both alignment (mIoU, Pointing Game) and faithfulness (iAUC) metrics, Saliency-Bench offers practical insights into how explanations correspond to ground-truth reasoning and model behavior, with implications for deploying trustworthy explanations in real-world tasks.

Abstract

Explainable AI (XAI) has gained significant attention for providing insights into the decision-making processes of deep learning models, particularly for image classification tasks through visual explanations visualized by saliency maps. Despite their success, challenges remain due to the lack of annotated datasets and standardized evaluation pipelines. In this paper, we introduce Saliency-Bench, a novel benchmark suite designed to evaluate visual explanations generated by saliency methods across multiple datasets. We curated, constructed, and annotated eight datasets, each covering diverse tasks such as scene classification, cancer diagnosis, object classification, and action classification, with corresponding ground-truth explanations. The benchmark includes a standardized and unified evaluation pipeline for assessing faithfulness and alignment of the visual explanation, providing a holistic visual explanation performance assessment. We benchmark these eight datasets with widely used saliency methods on different image classifier architectures to evaluate explanation quality. Additionally, we developed an easy-to-use API for automating the evaluation pipeline, from data accessing, and data loading, to result evaluation. The benchmark is available via our website: https://xaidataset.github.io.

Saliency-Bench: A Comprehensive Benchmark for Evaluating Visual Explanations

TL;DR

Saliency-Bench addresses the fragmentation in evaluating visual explanations by introducing a standardized benchmark with eight diverse, annotated datasets and a unified evaluation pipeline that jointly measures alignment and faithfulness of saliency maps. It benchmarks multiple saliency methods, including GradCAM, GradCAM++, Integrated Gradients, InputXGradient, Occlusion, RISE, and ViT attention, across CNN and transformer architectures, revealing dataset- and model-dependent strengths and limitations. The work provides an easy-to-use API (xaibenchmark) to load data, generate explanations, and compute metrics, enabling reproducible comparisons and accelerating progress in XAI. By systematically analyzing both alignment (mIoU, Pointing Game) and faithfulness (iAUC) metrics, Saliency-Bench offers practical insights into how explanations correspond to ground-truth reasoning and model behavior, with implications for deploying trustworthy explanations in real-world tasks.

Abstract

Explainable AI (XAI) has gained significant attention for providing insights into the decision-making processes of deep learning models, particularly for image classification tasks through visual explanations visualized by saliency maps. Despite their success, challenges remain due to the lack of annotated datasets and standardized evaluation pipelines. In this paper, we introduce Saliency-Bench, a novel benchmark suite designed to evaluate visual explanations generated by saliency methods across multiple datasets. We curated, constructed, and annotated eight datasets, each covering diverse tasks such as scene classification, cancer diagnosis, object classification, and action classification, with corresponding ground-truth explanations. The benchmark includes a standardized and unified evaluation pipeline for assessing faithfulness and alignment of the visual explanation, providing a holistic visual explanation performance assessment. We benchmark these eight datasets with widely used saliency methods on different image classifier architectures to evaluate explanation quality. Additionally, we developed an easy-to-use API for automating the evaluation pipeline, from data accessing, and data loading, to result evaluation. The benchmark is available via our website: https://xaidataset.github.io.
Paper Structure (29 sections, 3 equations, 10 figures, 4 tables)

This paper contains 29 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Example images from the eight datasets—Gender-XAI, Environment-XAI, Disease-XAI, Cancer-XAI, Security-XAI, Pet-XAI, Action-XAI, and Object-XAI—across different tasks. Each image is paired with a ground-truth explanation annotation.
  • Figure 2: Overview of Saliency-Bench: A Comprehensive Benchmark for Evaluating Visual Explanations.
  • Figure 3: Examples of mIoU and Pointing Game comparing saliency maps generated by Grad-CAM with ground-truth annotations on the Action-XAI dataset.
  • Figure 4: Qualitative results of visual explanation methods: (1) Original image; (2) Saliency map generated by GradCAM; (4) Saliency map generated by InputXGradient; (3) Generated by attention mechanisms of ViT-B/16.
  • Figure 5: xaibenchmark Python package installation.
  • ...and 5 more figures