SpecFormer: Guarding Vision Transformer Robustness via Maximum Singular Value Penalization
Xixu Hu, Runkai Zheng, Jindong Wang, Cheuk Hang Leung, Qi Wu, Xing Xie
TL;DR
This work targets adversarial robustness of Vision Transformers by deriving local Lipschitz bounds for the self-attention mechanism and proposing Maximum Singular Value Penalization (MSVP) to bound these Lipschitz constants. SpecFormer integrates MSVP into attention layers and uses power iteration to efficiently estimate spectral norms, improving stability without sacrificing training efficiency. Extensive experiments on CIFAR-10/100, Imagenette, and ImageNet across multiple ViT variants show state-of-the-art robustness under standard and adversarial training, with notable gains against FGSM, PGD, CW, and AutoAttack while preserving or improving clean accuracy. The approach provides a theoretically grounded, simple, and broadly applicable method for enhancing ViT robustness in practical settings, with code available at the provided repository.
Abstract
Vision Transformers (ViTs) are increasingly used in computer vision due to their high performance, but their vulnerability to adversarial attacks is a concern. Existing methods lack a solid theoretical basis, focusing mainly on empirical training adjustments. This study introduces SpecFormer, tailored to fortify ViTs against adversarial attacks, with theoretical underpinnings. We establish local Lipschitz bounds for the self-attention layer and propose the Maximum Singular Value Penalization (MSVP) to precisely manage these bounds By incorporating MSVP into ViTs' attention layers, we enhance the model's robustness without compromising training efficiency. SpecFormer, the resulting model, outperforms other state-of-the-art models in defending against adversarial attacks, as proven by experiments on CIFAR and ImageNet datasets. Code is released at https://github.com/microsoft/robustlearn.
