Table of Contents
Fetching ...

Decoupling Pixel Flipping and Occlusion Strategy for Consistent XAI Benchmarks

Stefan Blücher, Johanna Vielhaben, Nils Strodthoff

TL;DR

This work addresses the instability of pixel-flipping (PF) benchmarks for XAI caused by varied occlusion strategies. It introduces the Reference-out-of-model-scope (R-OMS) score to quantify occlusion reliability and the symmetric relevance gain (SRG) to combine most- and least-influential feature rankings, yielding consistent method rankings across diverse PF setups. The findings show diffusion imputers provide reliable occluded samples but at high cost, while SRG enables trustworthy benchmarking with cheaper imputers and reduces dependence on occlusion strategy. Overall, the approach improves comparability and reproducibility in XAI benchmarking, guiding practitioners toward robust, strategy-agnostic evaluations of attribution methods.

Abstract

Feature removal is a central building block for eXplainable AI (XAI), both for occlusion-based explanations (Shapley values) as well as their evaluation (pixel flipping, PF). However, occlusion strategies can vary significantly from simple mean replacement up to inpainting with state-of-the-art diffusion models. This ambiguity limits the usefulness of occlusion-based approaches. For example, PF benchmarks lead to contradicting rankings. This is amplified by competing PF measures: Features are either removed starting with most influential first (MIF) or least influential first (LIF). This study proposes two complementary perspectives to resolve this disagreement problem. Firstly, we address the common criticism of occlusion-based XAI, that artificial samples lead to unreliable model evaluations. We propose to measure the reliability by the R(eference)-Out-of-Model-Scope (OMS) score. The R-OMS score enables a systematic comparison of occlusion strategies and resolves the disagreement problem by grouping consistent PF rankings. Secondly, we show that the insightfulness of MIF and LIF is conversely dependent on the R-OMS score. To leverage this, we combine the MIF and LIF measures into the symmetric relevance gain (SRG) measure. This breaks the inherent connection to the underlying occlusion strategy and leads to consistent rankings. This resolves the disagreement problem, which we verify for a set of 40 different occlusion strategies.

Decoupling Pixel Flipping and Occlusion Strategy for Consistent XAI Benchmarks

TL;DR

This work addresses the instability of pixel-flipping (PF) benchmarks for XAI caused by varied occlusion strategies. It introduces the Reference-out-of-model-scope (R-OMS) score to quantify occlusion reliability and the symmetric relevance gain (SRG) to combine most- and least-influential feature rankings, yielding consistent method rankings across diverse PF setups. The findings show diffusion imputers provide reliable occluded samples but at high cost, while SRG enables trustworthy benchmarking with cheaper imputers and reduces dependence on occlusion strategy. Overall, the approach improves comparability and reproducibility in XAI benchmarking, guiding practitioners toward robust, strategy-agnostic evaluations of attribution methods.

Abstract

Feature removal is a central building block for eXplainable AI (XAI), both for occlusion-based explanations (Shapley values) as well as their evaluation (pixel flipping, PF). However, occlusion strategies can vary significantly from simple mean replacement up to inpainting with state-of-the-art diffusion models. This ambiguity limits the usefulness of occlusion-based approaches. For example, PF benchmarks lead to contradicting rankings. This is amplified by competing PF measures: Features are either removed starting with most influential first (MIF) or least influential first (LIF). This study proposes two complementary perspectives to resolve this disagreement problem. Firstly, we address the common criticism of occlusion-based XAI, that artificial samples lead to unreliable model evaluations. We propose to measure the reliability by the R(eference)-Out-of-Model-Scope (OMS) score. The R-OMS score enables a systematic comparison of occlusion strategies and resolves the disagreement problem by grouping consistent PF rankings. Secondly, we show that the insightfulness of MIF and LIF is conversely dependent on the R-OMS score. To leverage this, we combine the MIF and LIF measures into the symmetric relevance gain (SRG) measure. This breaks the inherent connection to the underlying occlusion strategy and leads to consistent rankings. This resolves the disagreement problem, which we verify for a set of 40 different occlusion strategies.
Paper Structure (24 sections, 5 equations, 9 figures, 9 tables)

This paper contains 24 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Pixel flipping benchmarks of XAI methods. Both MIF and LIF are affected by the random baseline. Using the complete symmetric relevance gain (SRG) introduced in \ref{['imputation:subsec:scg_metric']} breaks the inherent dependence on the occlusion strategy.
  • Figure 2: R-OMS vs. NR-OMS.
  • Figure 3: (A): Occlusion strategy and model interact. (B, C, D): Visualize variation of each design choice for fixed standard-ResNet50. (Summary) reports the average variation (interquartile ranges) associated with each design choice. (B) Granularity of segmentation (C) Shape of segmentation. (D) Imputer choice.
  • Figure 4: PF benchmarks based on varying occlusion strategies lead to many disagreeing rankings for both MIF and LIF. Sorting rankings based on the $\overline{\text{R-OMS}}$ groups consistent rankings. The lower panel visualizes the disagreement problem as the deviation from the most frequent ranking (reference). The consistency of MIF (high) and LIF (low to medium) are complementary when sorting based on the $\overline{\text{R-OMS}}$.
  • Figure 5: Consistency of SRG measure independent from the occlusion strategies. Left panel: theoretically achievable improvement over the random baseline. Right panel: SRG rankings of XAI methods (legend in \ref{['imputation:fig:ranking']}).
  • ...and 4 more figures