Vision Transformers Don't Need Trained Registers
Nick Jiang, Amil Dravid, Alexei Efros, Yossi Gandelsman
TL;DR
The paper addresses high-norm, outlier tokens in Vision Transformers that disrupt attention by revealing a sparse set of 'register neurons' responsible for such outliers. It shows that by editing activations of these neurons, one can relocate high-norm tokens to a new test-time register token, effectively mimicking trained registers without retraining. Across classification, dense prediction, unsupervised object discovery, and vision-language tasks, test-time registers yield cleaner attention and competitive or improved performance compared to models trained with registers, and they improve interpretability of cross-modal attributions. The work provides a simple mathematical framing of register-neuron dynamics and demonstrates practical training-free deployment, broadening the applicability of register-based artifacts mitigation to pre-trained ViTs and VLMs. Mathematical insights and empirical validations support the claim that test-time registers can replace training-time register tokens in many settings, offering a scalable, training-free solution for artifact mitigation in pre-trained models.
Abstract
We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers - the emergence of high-norm tokens that lead to noisy attention maps (Darcet et al., 2024). We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models, yielding cleaner attention-based, text-to-image attribution. Finally, we outline a simple mathematical model that reflects the observed behavior of register neurons and high norm tokens. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.
