Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Haoyang Liu, Aditya Singh, Yijiang Li, Haohan Wang
TL;DR
The paper investigates intrinsic robustness of Vision Transformers through the lens of nullspaces, starting from a nontrivial nullspace in the patch embedding layer and extending to a generalized, approximate nullspace for nonlinear encoder blocks. It introduces a practical nullspace noise augmentation strategy: synthesize $\epsilon$-approximate nullspace perturbations $\tilde{\mathbf{v}}$ and fine-tune the model to be tolerant to such noise, effectively enlarging the approximate nullspace. Empirically, this nullspace-noise augmented finetuning yields significant robustness improvements against adversarial attacks and distribution shifts on ImageNet-1k with ViT variants, often outperforming or matching adversarial training baselines while maintaining performance on clean data. The work highlights a new, intrinsic robustness mechanism in ViTs, demonstrates a scalable augmentation technique, and suggests future directions for exploiting nullspace-inspired invariances in deep networks.
Abstract
Enhancing the robustness of deep learning models, particularly in the realm of vision transformers (ViTs), is crucial for their real-world deployment. In this work, we provide a finetuning approach to enhance the robustness of vision transformers inspired by the concept of nullspace from linear algebra. Our investigation centers on whether a vision transformer can exhibit resilience to input variations akin to the nullspace property in linear mappings, which would imply that perturbations sampled from this nullspace do not influence the model's output when added to the input. We start from the observation that many existing ViTs satisfy this property because their patch embedding layer has a non-trivial nullspace. Then, we extend the notion of nullspace to nonlinear settings and demonstrate that it is possible to synthesize approximate nullspace elements for ViT's encoder blocks through optimization. Finally, we propose a finetuning strategy for ViTs wherein we augment the training data with synthesized approximate nullspace noise. We find that our finetuning approach significantly improves the models' robustness to both adversarial and natural image perturbations.\footnote{Code is available at: https://github.com/Liu-Hy/ns-vit.
