Table of Contents
Fetching ...

Causal Attribution via Activation Patching

Amirmohammad Izadi, Mohammadali Banayeeanzade, Alireza Mirrokni, Hosein Hasani, Mobin Bagherian, Faridoun Mehri, Mahdieh Soleymani Baghshah

Abstract

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.

Causal Attribution via Activation Patching

Abstract

Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.
Paper Structure (39 sections, 12 equations, 40 figures, 15 tables, 1 algorithm)

This paper contains 39 sections, 12 equations, 40 figures, 15 tables, 1 algorithm.

Figures (40)

  • Figure 1: Overview of CAAP. Given a source image (top) and a blank target image (middle), we extract internal activations from the source image corresponding to a selected patch and a specified range of layers. These activations are injected into the target context to form a patched context (bottom), while all other activations remain unchanged. A forward pass on the patched sequence produces a class score for the CLS token (green), which measures the causal effect of the selected patch on the target prediction. Repeating this intervention for all patches yields a patch-level attribution.
  • Figure 2: Qualitative attribution comparison between different methods for a representative ImageNet sample. CAAP produces more compact and well-localized attributions for the target class than the baselines. More examples are provided in the Appendix \ref{['app:more_qualitative']}.
  • Figure 3: Qualitative attribution comparison in a representative image containing several objects using CLIP-L/14. CAAP produces compact and class-specific attributions aligned with the queried class. More examples are provided in the Appendix \ref{['app:more_qualitative']}.
  • Figure 4: Target blank ablation on ImageNet across four ViT backbones. The type of target blank patches is varied (Black, Blur, Noisy Mean, Noisy, White), and faithfulness (Insertion, Deletion, Ins$-$Del) and localization (PG, AUPR$_1$, AUPR$_0$) are reported.
  • Figure 5: Selection operator ablation on ImageNet across four ViT backbones. The spatial support of the selection operator is varied by considering different neighborhood variants (No Padding, Radius-1 Box, Radius-2 Box, Radius-1 Manhattan), and faithfulness (Insertion, Deletion, Ins$-$Del) and localization (PG, AUPR$_1$, AUPR$_0$) are reported.
  • ...and 35 more figures