LipShiFT: A Certifiably Robust Shift-based Vision Transformer
Rohan Menon, Nicola Franco, Stephan Günnemann
TL;DR
This work tackles the challenge of certifiable robustness for transformer-based vision models by deriving tight Lipschitz bounds and integrating margin-based training. It introduces LipShiFT, a Lipschitz-continuous ShiftViT-based Vision Transformer that incorporates CenterNorm, MaxMin activation, LiResConv, fixed pooling, orthogonal initialization, DropPath/Dropout, and LLN, optimized with the EMMA margin loss for a certified $l_2$-norm radius of $\epsilon=36/255$. The authors provide upper-bound estimates of the model Lipschitz constant and demonstrate scalability to larger architectures, achieving competitive certified robustness across CIFAR-10/100 and Tiny ImageNet, alongside solid empirical robustness under AutoAttack. This work advances the practical deployment of certifiably robust vision transformers and suggests a viable route toward scalable, Lipschitz-certified models with real-world applicability.
Abstract
Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the $l_2$ norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.
