Table of Contents
Fetching ...

On The Coherence of Quantitative Evaluation of Visual Explanations

Benjamin Vandersmissen, Jose Oramas

TL;DR

The paper tackles the problem of incoherent quantitative evaluation of visual explanations for neural networks in real-world images. It conducts a large-scale study on the ImageNet-1k validation set, comparing multiple explanation methods across standard model-centered metrics and performing sanity checks to assess metric reliability. The findings show that different evaluation metrics capture distinct, sometimes opposing notions of explanation quality, with proxy tasks like the Pointing Game showing weak correlation to others and sparsity driving large score disparities. The work argues for cautious metric selection, highlights the impact of sparsity and coarseness on scores, and suggests binarization as a potential way to stabilize comparisons, contributing practical guidance for researchers and a call for more robust evaluation frameworks.

Abstract

Recent years have shown an increased development of methods for justifying the predictions of neural networks through visual explanations. These explanations usually take the form of heatmaps which assign a saliency (or relevance) value to each pixel of the input image that expresses how relevant the pixel is for the prediction of a label. Complementing this development, evaluation methods have been proposed to assess the "goodness" of such explanations. On the one hand, some of these methods rely on synthetic datasets. However, this introduces the weakness of having limited guarantees regarding their applicability on more realistic settings. On the other hand, some methods rely on metrics for objective evaluation. However the level to which some of these evaluation methods perform with respect to each other is uncertain. Taking this into account, we conduct a comprehensive study on a subset of the ImageNet-1k validation set where we evaluate a number of different commonly-used explanation methods following a set of evaluation methods. We complement our study with sanity checks on the studied evaluation methods as a means to investigate their reliability and the impact of characteristics of the explanations on the evaluation methods. Results of our study suggest that there is a lack of coherency on the grading provided by some of the considered evaluation methods. Moreover, we have identified some characteristics of the explanations, e.g. sparsity, which can have a significant effect on the performance.

On The Coherence of Quantitative Evaluation of Visual Explanations

TL;DR

The paper tackles the problem of incoherent quantitative evaluation of visual explanations for neural networks in real-world images. It conducts a large-scale study on the ImageNet-1k validation set, comparing multiple explanation methods across standard model-centered metrics and performing sanity checks to assess metric reliability. The findings show that different evaluation metrics capture distinct, sometimes opposing notions of explanation quality, with proxy tasks like the Pointing Game showing weak correlation to others and sparsity driving large score disparities. The work argues for cautious metric selection, highlights the impact of sparsity and coarseness on scores, and suggests binarization as a potential way to stabilize comparisons, contributing practical guidance for researchers and a call for more robust evaluation frameworks.

Abstract

Recent years have shown an increased development of methods for justifying the predictions of neural networks through visual explanations. These explanations usually take the form of heatmaps which assign a saliency (or relevance) value to each pixel of the input image that expresses how relevant the pixel is for the prediction of a label. Complementing this development, evaluation methods have been proposed to assess the "goodness" of such explanations. On the one hand, some of these methods rely on synthetic datasets. However, this introduces the weakness of having limited guarantees regarding their applicability on more realistic settings. On the other hand, some methods rely on metrics for objective evaluation. However the level to which some of these evaluation methods perform with respect to each other is uncertain. Taking this into account, we conduct a comprehensive study on a subset of the ImageNet-1k validation set where we evaluate a number of different commonly-used explanation methods following a set of evaluation methods. We complement our study with sanity checks on the studied evaluation methods as a means to investigate their reliability and the impact of characteristics of the explanations on the evaluation methods. Results of our study suggest that there is a lack of coherency on the grading provided by some of the considered evaluation methods. Moreover, we have identified some characteristics of the explanations, e.g. sparsity, which can have a significant effect on the performance.
Paper Structure (27 sections, 2 equations, 11 figures, 7 tables)

This paper contains 27 sections, 2 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Visual explanations for different samples using the ResNet-50 network. Also pictured is the color scale used to visualize the explanations. From left to right: input image, adaSISE, Grad-CAM, Integrated Gradients, LRP, occlusion, RISE, smoothgrad, TAME.
  • Figure 2: The resulting scores when evaluating the considered explanation methods. Top: The insertion and deletion curves. Bottom : the other evaluation results in tabular form.
  • Figure 3: The kernel density estimation of the relevance assigned by different explanation methods.
  • Figure 4: Comparing blurred versions of IG and smoothgrad explanations to the other explanation methods. Top: The insertion and deletion curves. Bottom : the other evaluation results in tabular form (Original values of IG and smoothgrad are provided between brackets).
  • Figure A1: The insertion curves for the ResNet-50 network. The curves on the left use pixel-level replacements, while the curves on the right use region-level replacement
  • ...and 6 more figures