Table of Contents
Fetching ...

Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

Haoyang Liu, Aditya Singh, Yijiang Li, Haohan Wang

TL;DR

The paper investigates intrinsic robustness of Vision Transformers through the lens of nullspaces, starting from a nontrivial nullspace in the patch embedding layer and extending to a generalized, approximate nullspace for nonlinear encoder blocks. It introduces a practical nullspace noise augmentation strategy: synthesize $\epsilon$-approximate nullspace perturbations $\tilde{\mathbf{v}}$ and fine-tune the model to be tolerant to such noise, effectively enlarging the approximate nullspace. Empirically, this nullspace-noise augmented finetuning yields significant robustness improvements against adversarial attacks and distribution shifts on ImageNet-1k with ViT variants, often outperforming or matching adversarial training baselines while maintaining performance on clean data. The work highlights a new, intrinsic robustness mechanism in ViTs, demonstrates a scalable augmentation technique, and suggests future directions for exploiting nullspace-inspired invariances in deep networks.

Abstract

Enhancing the robustness of deep learning models, particularly in the realm of vision transformers (ViTs), is crucial for their real-world deployment. In this work, we provide a finetuning approach to enhance the robustness of vision transformers inspired by the concept of nullspace from linear algebra. Our investigation centers on whether a vision transformer can exhibit resilience to input variations akin to the nullspace property in linear mappings, which would imply that perturbations sampled from this nullspace do not influence the model's output when added to the input. We start from the observation that many existing ViTs satisfy this property because their patch embedding layer has a non-trivial nullspace. Then, we extend the notion of nullspace to nonlinear settings and demonstrate that it is possible to synthesize approximate nullspace elements for ViT's encoder blocks through optimization. Finally, we propose a finetuning strategy for ViTs wherein we augment the training data with synthesized approximate nullspace noise. We find that our finetuning approach significantly improves the models' robustness to both adversarial and natural image perturbations.\footnote{Code is available at: https://github.com/Liu-Hy/ns-vit.

Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

TL;DR

The paper investigates intrinsic robustness of Vision Transformers through the lens of nullspaces, starting from a nontrivial nullspace in the patch embedding layer and extending to a generalized, approximate nullspace for nonlinear encoder blocks. It introduces a practical nullspace noise augmentation strategy: synthesize -approximate nullspace perturbations and fine-tune the model to be tolerant to such noise, effectively enlarging the approximate nullspace. Empirically, this nullspace-noise augmented finetuning yields significant robustness improvements against adversarial attacks and distribution shifts on ImageNet-1k with ViT variants, often outperforming or matching adversarial training baselines while maintaining performance on clean data. The work highlights a new, intrinsic robustness mechanism in ViTs, demonstrates a scalable augmentation technique, and suggests future directions for exploiting nullspace-inspired invariances in deep networks.

Abstract

Enhancing the robustness of deep learning models, particularly in the realm of vision transformers (ViTs), is crucial for their real-world deployment. In this work, we provide a finetuning approach to enhance the robustness of vision transformers inspired by the concept of nullspace from linear algebra. Our investigation centers on whether a vision transformer can exhibit resilience to input variations akin to the nullspace property in linear mappings, which would imply that perturbations sampled from this nullspace do not influence the model's output when added to the input. We start from the observation that many existing ViTs satisfy this property because their patch embedding layer has a non-trivial nullspace. Then, we extend the notion of nullspace to nonlinear settings and demonstrate that it is possible to synthesize approximate nullspace elements for ViT's encoder blocks through optimization. Finally, we propose a finetuning strategy for ViTs wherein we augment the training data with synthesized approximate nullspace noise. We find that our finetuning approach significantly improves the models' robustness to both adversarial and natural image perturbations.\footnote{Code is available at: https://github.com/Liu-Hy/ns-vit.
Paper Structure (23 sections, 1 theorem, 11 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 1 theorem, 11 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Consider a self-attention layer with $h$ heads and $\{(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i)\}_{i=1}^h$ as its query, key and value projection matrices. If the following conditions are met then there exists at least one $\mathbf{W}$ such that $\mathbf{W} \neq \mathbf{0}$ and $\operatorname{head}_i(\mathbf{X+W}) = \operatorname{head}_i(\mathbf{X})$ for all attention head $i$ in this layer and

Figures (9)

  • Figure 1: An illustration of the nullspace in three cases (projection function, left top; linear function, left bottom; vision transformer, right). For these three cases, there exists some nullspace, such that the output of the function with respect to the input will remain unperturbed regardless of the perturbation strength. Also, the nullspace is function-specific (model-specific) and will not vary for different samples.
  • Figure 1: Nullspace dimensions for pre-trained ViT models. Nullspace is trivial ($\mathbf{0}$) when embedding dimension exceeds input dimension.
  • Figure 2: An example of nullspace noise. We show (a) sample input image, (b) noise generated by the basis vectors of the nullspace and (c) noisy image as a result of adding the nullspace noise to the input. Model's predictions for the clean and noisy inputs are identical.
  • Figure 3: Exploratory experiments on the generalized nullspace. (a) Solid lines (--) represents the model performance under the learned noise, and dashed lines ($\cdot\cdot\cdot$) represent the performance after random permutation of the elements of the learned noise vector. (b) by changing the regularization strengths, we explore noise in the generalized nullspace at different magnitudes.
  • Figure 4: Change trend of multiple metrics with training steps. "Adversarial" is the average performance of the 4 adv. robustness settings, "OOD" is the average score on the six OOD datasets, and "avg" is the total average. All values are divided by their initial values to show the trend more clearly.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Remark 1