Table of Contents
Fetching ...

Robustness Tokens: Towards Adversarial Robustness of Transformers

Brian Pulfer, Yury Belousov, Slava Voloshynovskiy

TL;DR

This work tackles the vulnerability of publicly available Vision Foundation Models to adversarial inputs by introducing Robustness Tokens, a small set of private input tokens appended to transformer sequences. The tokens are trained with two objectives, $\\mathcal{L}_{\\text{inv}}(\\mathbf{r})$ and $\\mathcal{L}_{\\text{adv}}(\\mathbf{r})$, so that representations remain stable on clean data and align under adversarial perturbations via $\\mathcal{L}(\\mathbf{r}) = \\mathcal{L}_{\\text{inv}}(\\mathbf{r}) + \\mathcal{L}_{\\text{adv}}(\\mathbf{r})$, while updating only $\\mathbf{r}$. Empirical results on DiNOv2, OpenCLIP, and DEIT-III show that Robustness Tokens preserve downstream performance while substantially improving robustness against white-box attacks, generalize to attacks like AutoAttack, and converge rapidly with low training cost. The approach also reveals that larger models exhibit larger robustness gains and that massive activations observed in transformers can be mitigated by the tokens, suggesting practical, efficient defenses for transformer backbones in real-world deployments.

Abstract

Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.

Robustness Tokens: Towards Adversarial Robustness of Transformers

TL;DR

This work tackles the vulnerability of publicly available Vision Foundation Models to adversarial inputs by introducing Robustness Tokens, a small set of private input tokens appended to transformer sequences. The tokens are trained with two objectives, and , so that representations remain stable on clean data and align under adversarial perturbations via , while updating only . Empirical results on DiNOv2, OpenCLIP, and DEIT-III show that Robustness Tokens preserve downstream performance while substantially improving robustness against white-box attacks, generalize to attacks like AutoAttack, and converge rapidly with low training cost. The approach also reveals that larger models exhibit larger robustness gains and that massive activations observed in transformers can be mitigated by the tokens, suggesting practical, efficient defenses for transformer backbones in real-world deployments.

Abstract

Recently, large pre-trained foundation models have become widely adopted by machine learning practitioners for a multitude of tasks. Given that such models are publicly available, relying on their use as backbone models for downstream tasks might result in high vulnerability to adversarial attacks crafted with the same public model. In this work, we propose Robustness Tokens, a novel approach specific to the transformer architecture that fine-tunes a few additional private tokens with low computational requirements instead of tuning model parameters as done in traditional adversarial training. We show that Robustness Tokens make Vision Transformer models significantly more robust to white-box adversarial attacks while also retaining the original downstream performances.

Paper Structure

This paper contains 19 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Schematic representation of Robustness Tokens
  • Figure 2: Pytorch-like pseudocode for training Robustness Tokens.
  • Figure 3: Segmentation predictions of vanilla and enhanced DiNOv2 ViT-B/14 on clean and adversarial samples from the ADE20K dataset. From left to right: Ground truth segmentation mask, clean sample, prediction on the clean sample, prediction with Robustness Tokens on the clean sample, adversarial sample, prediction on the adversarial sample, prediction with Robustness Tokens on the adversarial sample.
  • Figure 4: Training curves for training of Robustness Tokens for DiNOv2. We train small, base, large, and giant models with 1, 10, 20, and 50 Robustness Tokens. Training converges within a few steps in all cases.
  • Figure 5: $\mathcal{L}_{\text{inv}}$ and $\mathcal{L}_{\text{adv}}$ terms through training for base, large and huge DEIT-III and OpenCLIP models. While the $\mathcal{L}_{\text{inv}}$ term is relatively stable through training, the $\mathcal{L}_{\text{adv}}$ is quickly maximized.
  • ...and 1 more figures