Table of Contents
Fetching ...

Improving Interpretation Faithfulness for Vision Transformers

Lijie Hu, Yixin Liu, Ninghao Liu, Mengdi Huai, Lichao Sun, Di Wang

TL;DR

This work addresses the fragility of attention-based explanations in Vision Transformers by introducing Faithful Vision Transformers (FViTs) and a constructive method called Denoised Diffusion Smoothing (DDS). DDS combines randomized smoothing with diffusion-based denoising to produce attention and predictions that remain stable under input perturbations, effectively turning ViTs into FViTs. The authors provide a formal definition of faithfulness based on stable top-$k$ attention and robust prediction distributions, prove conditions under which DDS yields a faithful attention module, and show near-optimality of Gaussian noise for both $\ell_2$ and $\ell_\infty$ norms. Extensive experiments on ImageNet, Cityscape, and COCO across ViT, DeiT, and Swin models demonstrate improved interpretability faithfulness and robustness to adversarial perturbations, with a favorable computation cost, suggesting practical applicability to larger LVLMs and safety-focused AI systems.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks. One reason behind the success lies in their ability to provide plausible innate explanations for the behavior of neural architectures. However, ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks and can be easily changed with even slight perturbations on the input image. In this paper, we propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs). Briefly speaking, an FViT should have the following two properties: (1) The top-$k$ indices of its self-attention vector should remain mostly unchanged under input perturbation, indicating stable explanations; (2) The prediction distribution should be robust to perturbations. To achieve this, we propose a new method called Denoised Diffusion Smoothing (DDS), which adopts randomized smoothing and diffusion-based denoising. We theoretically prove that processing ViTs directly with DDS can turn them into FViTs. We also show that Gaussian noise is nearly optimal for both $\ell_2$ and $\ell_\infty$-norm cases. Finally, we demonstrate the effectiveness of our approach through comprehensive experiments and evaluations. Results show that FViTs are more robust against adversarial attacks while maintaining the explainability of attention, indicating higher faithfulness.

Improving Interpretation Faithfulness for Vision Transformers

TL;DR

This work addresses the fragility of attention-based explanations in Vision Transformers by introducing Faithful Vision Transformers (FViTs) and a constructive method called Denoised Diffusion Smoothing (DDS). DDS combines randomized smoothing with diffusion-based denoising to produce attention and predictions that remain stable under input perturbations, effectively turning ViTs into FViTs. The authors provide a formal definition of faithfulness based on stable top- attention and robust prediction distributions, prove conditions under which DDS yields a faithful attention module, and show near-optimality of Gaussian noise for both and norms. Extensive experiments on ImageNet, Cityscape, and COCO across ViT, DeiT, and Swin models demonstrate improved interpretability faithfulness and robustness to adversarial perturbations, with a favorable computation cost, suggesting practical applicability to larger LVLMs and safety-focused AI systems.

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance for various vision tasks. One reason behind the success lies in their ability to provide plausible innate explanations for the behavior of neural architectures. However, ViTs suffer from issues with explanation faithfulness, as their focal points are fragile to adversarial attacks and can be easily changed with even slight perturbations on the input image. In this paper, we propose a rigorous approach to mitigate these issues by introducing Faithful ViTs (FViTs). Briefly speaking, an FViT should have the following two properties: (1) The top- indices of its self-attention vector should remain mostly unchanged under input perturbation, indicating stable explanations; (2) The prediction distribution should be robust to perturbations. To achieve this, we propose a new method called Denoised Diffusion Smoothing (DDS), which adopts randomized smoothing and diffusion-based denoising. We theoretically prove that processing ViTs directly with DDS can turn them into FViTs. We also show that Gaussian noise is nearly optimal for both and -norm cases. Finally, we demonstrate the effectiveness of our approach through comprehensive experiments and evaluations. Results show that FViTs are more robust against adversarial attacks while maintaining the explainability of attention, indicating higher faithfulness.
Paper Structure (31 sections, 15 theorems, 33 equations, 10 figures, 10 tables, 2 algorithms)

This paper contains 31 sections, 15 theorems, 33 equations, 10 figures, 10 tables, 2 algorithms.

Key Result

Theorem 4.4

If a function is a $(R, D_\alpha, \gamma, \beta, k, \|\cdot\|)$-faithful attention module for ViTs, then if we have for all $x'$ such that where $\|x - x'\| \leq R$, where $\mathcal{G}$ is the set of classes, $p_{(1)}$ and $p_{(2)}$ refer to the largest and the second largest probabilities in $\{ p_i \}$, where $p_i$ is the probability that $\bar{y}(x)$ returns the $i$-th class.

Figures (10)

  • Figure 1: Visualization results of the attention map on corrupted input for different methods.
  • Figure 2: Class-specific explanation heat map visualizations under adversarial corruption. For each image, we present results for two different classes. Other baselines either give inconsistent interpretations or show wrong focus class regions under adversarial perturbations. While our method gives a consistent interpretation map and is robust against adversarial attacks.
  • Figure 3: (a) and (b) are results of the perturbation test. (c) is the sensitivity analysis results.
  • Figure 4: Class-specific explanation heat map visualizations under adversarial corruption. For each image, we present results for two different classes. Other baselines either give inconsistent interpretations or show wrong focus class regions under adversarial perturbations. While our method gives a consistent interpretation map and is robust against adversarial attacks.
  • Figure 5: Visualization results of the attention map on corrupted input for different methods.
  • ...and 5 more figures

Theorems & Definitions (26)

  • Definition 4.1
  • Definition 4.2: Faithful ViTs
  • Definition 4.3
  • Theorem 4.4
  • Theorem 5.1
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 5.4
  • proof
  • Lemma 2.1: Rényi Divergence Lemma li2019certified
  • ...and 16 more