Table of Contents
Fetching ...

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

Xixu Hu, Runkai Zheng, Jindong Wang, Cheuk Hang Leung, Qi Wu, Xing Xie

TL;DR

This work targets adversarial robustness of Vision Transformers by deriving local Lipschitz bounds for the self-attention mechanism and proposing Maximum Singular Value Penalization (MSVP) to bound these Lipschitz constants. SpecFormer integrates MSVP into attention layers and uses power iteration to efficiently estimate spectral norms, improving stability without sacrificing training efficiency. Extensive experiments on CIFAR-10/100, Imagenette, and ImageNet across multiple ViT variants show state-of-the-art robustness under standard and adversarial training, with notable gains against FGSM, PGD, CW, and AutoAttack while preserving or improving clean accuracy. The approach provides a theoretically grounded, simple, and broadly applicable method for enhancing ViT robustness in practical settings, with code available at the provided repository.

Abstract

Vision Transformers (ViTs) are increasingly used in computer vision due to their high performance, but their vulnerability to adversarial attacks is a concern. Existing methods lack a solid theoretical basis, focusing mainly on empirical training adjustments. This study introduces SpecFormer, tailored to fortify ViTs against adversarial attacks, with theoretical underpinnings. We establish local Lipschitz bounds for the self-attention layer and propose the Maximum Singular Value Penalization (MSVP) to precisely manage these bounds By incorporating MSVP into ViTs' attention layers, we enhance the model's robustness without compromising training efficiency. SpecFormer, the resulting model, outperforms other state-of-the-art models in defending against adversarial attacks, as proven by experiments on CIFAR and ImageNet datasets. Code is released at https://github.com/microsoft/robustlearn.

SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization

TL;DR

This work targets adversarial robustness of Vision Transformers by deriving local Lipschitz bounds for the self-attention mechanism and proposing Maximum Singular Value Penalization (MSVP) to bound these Lipschitz constants. SpecFormer integrates MSVP into attention layers and uses power iteration to efficiently estimate spectral norms, improving stability without sacrificing training efficiency. Extensive experiments on CIFAR-10/100, Imagenette, and ImageNet across multiple ViT variants show state-of-the-art robustness under standard and adversarial training, with notable gains against FGSM, PGD, CW, and AutoAttack while preserving or improving clean accuracy. The approach provides a theoretically grounded, simple, and broadly applicable method for enhancing ViT robustness in practical settings, with code available at the provided repository.

Abstract

Vision Transformers (ViTs) are increasingly used in computer vision due to their high performance, but their vulnerability to adversarial attacks is a concern. Existing methods lack a solid theoretical basis, focusing mainly on empirical training adjustments. This study introduces SpecFormer, tailored to fortify ViTs against adversarial attacks, with theoretical underpinnings. We establish local Lipschitz bounds for the self-attention layer and propose the Maximum Singular Value Penalization (MSVP) to precisely manage these bounds By incorporating MSVP into ViTs' attention layers, we enhance the model's robustness without compromising training efficiency. SpecFormer, the resulting model, outperforms other state-of-the-art models in defending against adversarial attacks, as proven by experiments on CIFAR and ImageNet datasets. Code is released at https://github.com/microsoft/robustlearn.
Paper Structure (23 sections, 5 theorems, 37 equations, 2 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 5 theorems, 37 equations, 2 figures, 8 tables, 1 algorithm.

Key Result

theorem thmcountertheorem

Let $f: {\mathcal{X}} \rightarrow \mathbb{R}^m$ be differentiable and locally Lipschitz continuous under a choice of $p$-norm $\|\cdot\|_p$. Let $\mathbf{J}_f(x)$ denote its total derivative (Jacobian) at $\mathbf{x}$. Then, where $\|\mathbf{J}_f(\mathbf{x})\|_p$ is the induced operator norm on $\mathbf{J}_f(\mathbf{x})$.

Figures (2)

  • Figure 1: SpecFormer with MSVP.
  • Figure 2: The analysis of MSVP. (a) Maximum singular value comparison between MSVP and the vanilla Transformer. (b)&(c) tSNE van2008visualizing feature visualization.

Theorems & Definitions (9)

  • definition thmcounterdefinition: Local Lipschitz Continuity
  • definition thmcounterdefinition: Local Lipschitz Constant
  • theorem thmcountertheorem: Calculation of Local Lipschitz Constant federer1969geometric
  • proposition thmcounterproposition: Model Sensitivity and Maximum Singular Value in Linear Modelsyoshida2017spectral
  • theorem thmcountertheorem
  • proof
  • lemma thmcounterlemma: Relationship between the spectral norm of a block row and the block matrix, kim2021lipschitz
  • theorem thmcountertheorem: Convergence guarantee of the power iteration method mises1929praktische
  • proof