Table of Contents
Fetching ...

Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers

Chashi Mahiul Islam, Samuel Jacob Chacko, Mao Nishino, Xiuwen Liu

TL;DR

The paper investigates how Vision Transformers exhibit representation vulnerabilities under adversarial perturbations and introduces NeuroShield-ViT, a defense that selectively neutralizes vulnerable neurons in early layers without adversarial training. Through three mechanistic analyses—global CLS token trajectories, spatial token perturbation dynamics, and neuron activation sensitivity—it shows that perturbations are subtle in early layers but amplify toward middle-to-late layers, driving misclassification. NeuroShield-ViT identifies adversarially important neurons via activations and gradients, applying a neutralization coefficient to attenuate their influence, achieving strong robustness against iterative attacks and zero-shot generalization with limited calibration data. These findings offer a path toward robust ViT designs and guidance for layer- and neuron-level defenses in Transformer-based vision systems.

Abstract

While transformer-based models dominate NLP and vision applications, their underlying mechanisms to map the input space to the label space semantically are not well understood. In this paper, we study the sources of known representation vulnerabilities of vision transformers (ViT), where perceptually identical images can have very different representations and semantically unrelated images can have the same representation. Our analysis indicates that imperceptible changes to the input can result in significant representation changes, particularly in later layers, suggesting potential instabilities in the performance of ViTs. Our comprehensive study reveals that adversarial effects, while subtle in early layers, propagate and amplify through the network, becoming most pronounced in middle to late layers. This insight motivates the development of NeuroShield-ViT, a novel defense mechanism that strategically neutralizes vulnerable neurons in earlier layers to prevent the cascade of adversarial effects. We demonstrate NeuroShield-ViT's effectiveness across various attacks, particularly excelling against strong iterative attacks, and showcase its remarkable zero-shot generalization capabilities. Without fine-tuning, our method achieves a competitive accuracy of 77.8% on adversarial examples, surpassing conventional robustness methods. Our results shed new light on how adversarial effects propagate through ViT layers, while providing a promising approach to enhance the robustness of vision transformers against adversarial attacks. Additionally, they provide a promising approach to enhance the robustness of vision transformers against adversarial attacks.

Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers

TL;DR

The paper investigates how Vision Transformers exhibit representation vulnerabilities under adversarial perturbations and introduces NeuroShield-ViT, a defense that selectively neutralizes vulnerable neurons in early layers without adversarial training. Through three mechanistic analyses—global CLS token trajectories, spatial token perturbation dynamics, and neuron activation sensitivity—it shows that perturbations are subtle in early layers but amplify toward middle-to-late layers, driving misclassification. NeuroShield-ViT identifies adversarially important neurons via activations and gradients, applying a neutralization coefficient to attenuate their influence, achieving strong robustness against iterative attacks and zero-shot generalization with limited calibration data. These findings offer a path toward robust ViT designs and guidance for layer- and neuron-level defenses in Transformer-based vision systems.

Abstract

While transformer-based models dominate NLP and vision applications, their underlying mechanisms to map the input space to the label space semantically are not well understood. In this paper, we study the sources of known representation vulnerabilities of vision transformers (ViT), where perceptually identical images can have very different representations and semantically unrelated images can have the same representation. Our analysis indicates that imperceptible changes to the input can result in significant representation changes, particularly in later layers, suggesting potential instabilities in the performance of ViTs. Our comprehensive study reveals that adversarial effects, while subtle in early layers, propagate and amplify through the network, becoming most pronounced in middle to late layers. This insight motivates the development of NeuroShield-ViT, a novel defense mechanism that strategically neutralizes vulnerable neurons in earlier layers to prevent the cascade of adversarial effects. We demonstrate NeuroShield-ViT's effectiveness across various attacks, particularly excelling against strong iterative attacks, and showcase its remarkable zero-shot generalization capabilities. Without fine-tuning, our method achieves a competitive accuracy of 77.8% on adversarial examples, surpassing conventional robustness methods. Our results shed new light on how adversarial effects propagate through ViT layers, while providing a promising approach to enhance the robustness of vision transformers against adversarial attacks. Additionally, they provide a promising approach to enhance the robustness of vision transformers against adversarial attacks.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Average layer-wise cosine similarity of CLS embeddings for input-perturbed, target-perturbed, and input-target image pairs across the ViT-B layers.
  • Figure 2: Heatmaps illustrating the distribution of cosine similarities for image token embeddings in attention projection (attn.proj), MLP feed-forward (mlp.fc2), and residual connection (add_1) blocks across ViT-B layers.
  • Figure 3: Top-p% adversarial neurons per representation responsible for adversarial effects across layers.
  • Figure 4: Impact of layer-wise neutralization on model accuracy
  • Figure 5: Impact of the subset size (%) of NCS on the model accuracy.