Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities

Jialin Wu; Kaikai Pan; Yanjiao Chen; Jiangyi Deng; Shengyuan Pang; Wenyuan Xu

Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities

Jialin Wu, Kaikai Pan, Yanjiao Chen, Jiangyi Deng, Shengyuan Pang, Wenyuan Xu

TL;DR

This work targets adversarial vulnerabilities in Vision Transformers (ViTs) by introducing Protego, a plug-in detector that exploits transformer intrinsic capabilities to detect adversarial inputs without modifying the ViT backbone. By extracting high-level features from transformer layers and comparing the distributions of clean versus adversarial representations through a lightweight, one-layer detector, Protego achieves robust detection (AUC > $0.95$ across six attack types) on ImageNet with three pre-trained ViTs. Interpretability is addressed via Attention Rollout and Gradient Attention Rollout to understand how adversarial inputs shift attention patterns, while the detector is trained with SGDM and cross-entropy loss to distinguish adversarial from normal samples. The approach demonstrates superior performance over baselines like LID and feature squeezing, highlighting practical implications for metaverse security where resilient visual perception is critical. Overall, Protego offers a practical, plug-in defense that enhances ViT robustness against a range of white-box and black-box attacks, with potential for extension to cross-modal and larger-scale multimodal models.

Abstract

Transformer models have excelled in natural language tasks, prompting the vision community to explore their implementation in computer vision problems. However, these models are still influenced by adversarial examples. In this paper, we investigate the attack capabilities of six common adversarial attacks on three pretrained ViT models to reveal the vulnerability of ViT models. To understand and analyse the bias in neural network decisions when the input is adversarial, we use two visualisation techniques that are attention rollout and grad attention rollout. To prevent ViT models from adversarial attack, we propose Protego, a detection framework that leverages the transformer intrinsic capabilities to detection adversarial examples of ViT models. Nonetheless, this is challenging due to a diversity of attack strategies that may be adopted by adversaries. Inspired by the attention mechanism, we know that the token of prediction contains all the information from the input sample. Additionally, the attention region for adversarial examples differs from that of normal examples. Given these points, we can train a detector that achieves superior performance than existing detection methods to identify adversarial examples. Our experiments have demonstrated the high effectiveness of our detection method. For these six adversarial attack methods, our detector's AUC scores all exceed 0.95. Protego may advance investigations in metaverse security.

Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities

TL;DR

across six attack types) on ImageNet with three pre-trained ViTs. Interpretability is addressed via Attention Rollout and Gradient Attention Rollout to understand how adversarial inputs shift attention patterns, while the detector is trained with SGDM and cross-entropy loss to distinguish adversarial from normal samples. The approach demonstrates superior performance over baselines like LID and feature squeezing, highlighting practical implications for metaverse security where resilient visual perception is critical. Overall, Protego offers a practical, plug-in defense that enhances ViT robustness against a range of white-box and black-box attacks, with potential for extension to cross-modal and larger-scale multimodal models.

Abstract

Paper Structure (26 sections, 14 equations, 6 figures, 4 tables)

This paper contains 26 sections, 14 equations, 6 figures, 4 tables.

Introduction
Background & Related works
Vision Transformer
Adversarial attack
White-Box Attacks
Black-Box Attacks
Defense
Detection
Model enhancement
Design
Features Extracting
Architecture of Plugin-Detector
Vision Transformer Interpretability
Training & Loss
Evaluation
...and 11 more sections

Figures (6)

Figure 1: Security issues in the computer vision domain within the metaverse. The Protego is a charm(our detector) that protected the caster(model) with an invisible shield that reflected spells and blocked physical entities(adversarial examples).
Figure 2: The architecture of Vision Transformer, and the attention mechanism in transformer encoder.
Figure 3: Framework of Protego. The detector is trained on the features in the transformer and can be inserted into the layer that extracts the features of the transformer block.
Figure 4: Attack effectiveness for three ViT models. $\epsilon$ are $1/256, 2/256, 4/256, 8/256, 16/256, 32/256$.
Figure 5: The results obtained using the attention rollout method and the grad attention rollout method on normal examples.
...and 1 more figures

Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities

TL;DR

Abstract

Protego: Detecting Adversarial Examples for Vision Transformers via Intrinsic Capabilities

Authors

TL;DR

Abstract

Table of Contents

Figures (6)