Table of Contents
Fetching ...

Vision Transformers Don't Need Trained Registers

Nick Jiang, Amil Dravid, Alexei Efros, Yossi Gandelsman

TL;DR

The paper addresses high-norm, outlier tokens in Vision Transformers that disrupt attention by revealing a sparse set of 'register neurons' responsible for such outliers. It shows that by editing activations of these neurons, one can relocate high-norm tokens to a new test-time register token, effectively mimicking trained registers without retraining. Across classification, dense prediction, unsupervised object discovery, and vision-language tasks, test-time registers yield cleaner attention and competitive or improved performance compared to models trained with registers, and they improve interpretability of cross-modal attributions. The work provides a simple mathematical framing of register-neuron dynamics and demonstrates practical training-free deployment, broadening the applicability of register-based artifacts mitigation to pre-trained ViTs and VLMs. Mathematical insights and empirical validations support the claim that test-time registers can replace training-time register tokens in many settings, offering a scalable, training-free solution for artifact mitigation in pre-trained models.

Abstract

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers - the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

Vision Transformers Don't Need Trained Registers

TL;DR

The paper addresses high-norm, outlier tokens in Vision Transformers that disrupt attention by revealing a sparse set of 'register neurons' responsible for such outliers. It shows that by editing activations of these neurons, one can relocate high-norm tokens to a new test-time register token, effectively mimicking trained registers without retraining. Across classification, dense prediction, unsupervised object discovery, and vision-language tasks, test-time registers yield cleaner attention and competitive or improved performance compared to models trained with registers, and they improve interpretability of cross-modal attributions. The work provides a simple mathematical framing of register-neuron dynamics and demonstrates practical training-free deployment, broadening the applicability of register-based artifacts mitigation to pre-trained ViTs and VLMs. Mathematical insights and empirical validations support the claim that test-time registers can replace training-time register tokens in many settings, offering a scalable, training-free solution for artifact mitigation in pre-trained models.

Abstract

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers - the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

Paper Structure

This paper contains 35 sections, 4 theorems, 19 equations, 34 figures, 13 tables, 3 algorithms.

Key Result

Proposition 1

Let $u_1 = (W_{\text{in}}^{(1)})_{:, 1}$ be a register neuron and $u_2 = (W_{\text{out}}^{(1)})_{1, :}$ be the corresponding row in the MLP's second weight matrix, with $\|u_2\| \gg\|(W_{\text{out}}^{(1)})_{j,:}\|$ for $j \neq 1$. If both $u_1, u_2 \in \ker(W_{\text{in}}^{(2)\top}) \cap \ker(W^\top_

Figures (34)

  • Figure 1: Controlling high-norm tokens in Vision Transformers. As shown in darcet2024vision, high-norm outlier tokens emerge in ViTs and lead to noisy attention maps ("Original"). By identifying the mechanism responsible for their emergence, we demonstrate that we can shift these outlier tokens to arbitrary positions at test time ("Shifted"). Shifting the outlier tokens outside of the image mimics register behavior at test-time ("w/ Test-time Register"), resulting in more interpretable attention patterns and downstream performance comparable to models retrained with registers.
  • Figure 2: Outlier patches appear after MLPs; attention sinks appear after outlier patches. Left: Max norms across image patches (OpenCLIP ViT-B/16). Right: max attention scores of the [CLS] token in the last layer. In both plots, we average across 1000 images. The outlier norms and attention sinks occur in consecutive layers.
  • Figure 3: Highly activated neurons on the top outlier activate on all outlier positions. We present activation maps of three neurons from layer 6 that activate highly on the top outlier patch. These maps near-perfectly align with the high-norm outliers ("Patch Norms").
  • Figure 4: Intervening on activations of register neurons effectively shifts outliers to random patches and test-time registers. For all register neurons, we copy their highest activation into a selected patch and zero out the activations elsewhere. Left: norm of chosen random patch (yellow) and max norm of any other patch (blue). Right: [CLS] attention to chosen random patch (yellow) and max [CLS] attention (blue) to any other patch. Our intervention can shift outliers to randomly selected patches as well as test-time registers (see \ref{['appendix:openclip:tt_register']}).
  • Figure 5: Qualitative results on attention maps w/ test-time registers. We present the last layer's mean [CLS] attention maps in DINOv2 and compare them to the model with trained registers. Test-time registers produce similarly high-quality maps as trained registers.
  • ...and 29 more figures

Theorems & Definitions (4)

  • Proposition 1: Register neuron induces attention sink and no-op attention
  • Corollary 1: Register neuron induces implicit attention bias
  • Proposition 1: Register neuron induces attention sink and no-op attention
  • Corollary 1: Register neuron induces implicit attention bias