Table of Contents
Fetching ...

A Meaningful Perturbation Metric for Evaluating Explainability Methods

Danielle Cohen, Hila Chefer, Lior Wolf

TL;DR

This work addresses the challenge of evaluating attribution methods for deep vision models by exposing the failures of standard perturbation approaches that alter inputs in out-of-distribution ways. It introduces Stratified Inpainting, leveraging Stable Diffusion to replace high-relevance pixels with content conditioned on the second-highest class, and defines a weighting scheme to link inpainted changes to the original relevance maps via $f(C,E,I)$. Across CNNs and ViTs, and a broad set of attribution methods, the proposed metric yields rankings that align more closely with human preferences and better distinguish truly faithful explanations from random baselines, while reducing computation compared with full-inpainting baselines. The method offers a practical, scalable approach to standardizing explainability evaluation and enhancing interpretability in deep networks, with potential impact on model debugging, robustness, and user trust.

Abstract

Deep neural networks (DNNs) have demonstrated remarkable success, yet their wide adoption is often hindered by their opaque decision-making. To address this, attribution methods have been proposed to assign relevance values to each part of the input. However, different methods often produce entirely different relevance maps, necessitating the development of standardized metrics to evaluate them. Typically, such evaluation is performed through perturbation, wherein high- or low-relevance regions of the input image are manipulated to examine the change in prediction. In this work, we introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results. Through extensive experiments, we demonstrate the effectiveness of our approach in generating meaningful rankings across a wide range of models and attribution methods. Crucially, we establish that the ranking produced by our metric exhibits significantly higher correlation with human preferences compared to existing approaches, underscoring its potential for enhancing interpretability in DNNs.

A Meaningful Perturbation Metric for Evaluating Explainability Methods

TL;DR

This work addresses the challenge of evaluating attribution methods for deep vision models by exposing the failures of standard perturbation approaches that alter inputs in out-of-distribution ways. It introduces Stratified Inpainting, leveraging Stable Diffusion to replace high-relevance pixels with content conditioned on the second-highest class, and defines a weighting scheme to link inpainted changes to the original relevance maps via . Across CNNs and ViTs, and a broad set of attribution methods, the proposed metric yields rankings that align more closely with human preferences and better distinguish truly faithful explanations from random baselines, while reducing computation compared with full-inpainting baselines. The method offers a practical, scalable approach to standardizing explainability evaluation and enhancing interpretability in deep networks, with potential impact on model debugging, robustness, and user trust.

Abstract

Deep neural networks (DNNs) have demonstrated remarkable success, yet their wide adoption is often hindered by their opaque decision-making. To address this, attribution methods have been proposed to assign relevance values to each part of the input. However, different methods often produce entirely different relevance maps, necessitating the development of standardized metrics to evaluate them. Typically, such evaluation is performed through perturbation, wherein high- or low-relevance regions of the input image are manipulated to examine the change in prediction. In this work, we introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results. Through extensive experiments, we demonstrate the effectiveness of our approach in generating meaningful rankings across a wide range of models and attribution methods. Crucially, we establish that the ranking produced by our metric exhibits significantly higher correlation with human preferences compared to existing approaches, underscoring its potential for enhancing interpretability in DNNs.

Paper Structure

This paper contains 22 sections, 2 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Examples of OOD predictions following perturbation with (a) pixel deletion hooker2019benchmark (b) pixel blurring fong2017interpretable and (c) per channel mean hooker2019benchmark on both ResNet He2015DeepRL and ViT dosovitskiy2020image backbones. For each image, we randomly draw pixels and apply the perturbation method. We enclose the original and modified prediction. As can be seen, even when the majority of the content is still visible, the model is sensitive to the OOD effect of the perturbation, causing it to modify its prediction.
  • Figure 2: Perturbation comparison against the leading baselines with ResNet-50, VGG, AlexNet-based binary classifier, and ViT-B (please zoom in to view better). For each model, we consider the most common explainability algorithms, in addition to a random selection of pixels (see Sec \ref{['sec:experimental_setup']} for details). As can be observed, the baselines often struggle with separating random maps from actual relevance maps (e.g., delete for all models, blur for AlexNet, mean for all CNN variants) and appear to produce very similar results for all methods. Conversely, our method produces consistent ranking and meaningful distinction from the random baseline.
  • Figure 3: Qualitative comparison for SmoothGrad. According to all baselines, SmoothGrad is the leading method for CNN-based networks. We demonstrate examples where the baseline metrics indicate success for SmoothGrad (i.e., the prediction changed) while ours indicates failure (i.e., the prediction did not change to the top-2 class). The relevance maps produced by SmoothGrad often lead to OOD effects with the baselines (similar to Fig. \ref{['fig:rand_mask']}), thus their reliability for these results is questionable.
  • Figure 4: Qualitative comparison of successful class changes against baselines. We showcase image examples (Input) where both our method and the baselines induced a change in prediction (predictions indicated below each image). As seen, even when our method agrees with the baselines (i.e., the relevance map is faithful by all metrics), our method produces plausible pixel changes, while baselines cause OOD predictions.
  • Figure 5: User study evaluating the plausibility of our metric and baselines. We randomly select cases where the best relevance maps differs for (a) ResNet and (b) ViT. Users select the most plausible map based on the input and prediction. The table shows the percentage favoring our method, with users overwhelmingly preferring it.
  • ...and 3 more figures