Table of Contents
Fetching ...

LipShiFT: A Certifiably Robust Shift-based Vision Transformer

Rohan Menon, Nicola Franco, Stephan Günnemann

TL;DR

This work tackles the challenge of certifiable robustness for transformer-based vision models by deriving tight Lipschitz bounds and integrating margin-based training. It introduces LipShiFT, a Lipschitz-continuous ShiftViT-based Vision Transformer that incorporates CenterNorm, MaxMin activation, LiResConv, fixed pooling, orthogonal initialization, DropPath/Dropout, and LLN, optimized with the EMMA margin loss for a certified $l_2$-norm radius of $\epsilon=36/255$. The authors provide upper-bound estimates of the model Lipschitz constant and demonstrate scalability to larger architectures, achieving competitive certified robustness across CIFAR-10/100 and Tiny ImageNet, alongside solid empirical robustness under AutoAttack. This work advances the practical deployment of certifiably robust vision transformers and suggests a viable route toward scalable, Lipschitz-certified models with real-world applicability.

Abstract

Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the $l_2$ norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.

LipShiFT: A Certifiably Robust Shift-based Vision Transformer

TL;DR

This work tackles the challenge of certifiable robustness for transformer-based vision models by deriving tight Lipschitz bounds and integrating margin-based training. It introduces LipShiFT, a Lipschitz-continuous ShiftViT-based Vision Transformer that incorporates CenterNorm, MaxMin activation, LiResConv, fixed pooling, orthogonal initialization, DropPath/Dropout, and LLN, optimized with the EMMA margin loss for a certified -norm radius of . The authors provide upper-bound estimates of the model Lipschitz constant and demonstrate scalability to larger architectures, achieving competitive certified robustness across CIFAR-10/100 and Tiny ImageNet, alongside solid empirical robustness under AutoAttack. This work advances the practical deployment of certifiably robust vision transformers and suggests a viable route toward scalable, Lipschitz-certified models with real-world applicability.

Abstract

Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.

Paper Structure

This paper contains 28 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Proposed modified shift block.
  • Figure 2: Effect of model size on verified and clean accuracy.
  • Figure 3: Effect of dropout rate on accuracy.
  • Figure 4: Effect of learning rate on accuracy.
  • Figure 5: Effect of batch size on accuracy.