Table of Contents
Fetching ...

Leveraging Registers in Vision Transformers for Robust Adaptation

Srikar Yellapragada, Kowshik Thopalli, Vivek Narayanaswamy, Wesam Sakla, Yang Liu, Yamen Mubarka, Dimitris Samaras, Jayaraman J. Thiagarajan

TL;DR

Vision Transformers can be sensitive to distribution shifts due to high-norm patch tokens. This work examines register tokens as auxiliary information and fuses them with the CLS representation by forming $f_i = [c_i; \\mu_R^i]$ where $\\mu_R = \\frac{1}{M} \\sum_{k=1}^M r_k$, training only a linear probe on frozen backbones. Across Dino-v2 ViT backbones trained with and without registers, the proposed CLS+\\mu_R approach delivers 2–4 percentage point improvements in top-1 OOD accuracy and 2–3 percentage point reductions in anomaly-detection false positives, with no additional computational overhead. These results validate registers as a source of complementary, global information that enhances robustness and adaptive performance in Vision Transformers.

Abstract

Vision Transformers (ViTs) have shown success across a variety of tasks due to their ability to capture global image representations. Recent studies have identified the existence of high-norm tokens in ViTs, which can interfere with unsupervised object discovery. To address this, the use of "registers" which are additional tokens that isolate high norm patch tokens while capturing global image-level information has been proposed. While registers have been studied extensively for object discovery, their generalization properties particularly in out-of-distribution (OOD) scenarios, remains underexplored. In this paper, we examine the utility of register token embeddings in providing additional features for improving generalization and anomaly rejection. To that end, we propose a simple method that combines the special CLS token embedding commonly employed in ViTs with the average-pooled register embeddings to create feature representations which are subsequently used for training a downstream classifier. We find that this enhances OOD generalization and anomaly rejection, while maintaining in-distribution (ID) performance. Extensive experiments across multiple ViT backbones trained with and without registers reveal consistent improvements of 2-4\% in top-1 OOD accuracy and a 2-3\% reduction in false positive rates for anomaly detection. Importantly, these gains are achieved without additional computational overhead.

Leveraging Registers in Vision Transformers for Robust Adaptation

TL;DR

Vision Transformers can be sensitive to distribution shifts due to high-norm patch tokens. This work examines register tokens as auxiliary information and fuses them with the CLS representation by forming where , training only a linear probe on frozen backbones. Across Dino-v2 ViT backbones trained with and without registers, the proposed CLS+\\mu_R approach delivers 2–4 percentage point improvements in top-1 OOD accuracy and 2–3 percentage point reductions in anomaly-detection false positives, with no additional computational overhead. These results validate registers as a source of complementary, global information that enhances robustness and adaptive performance in Vision Transformers.

Abstract

Vision Transformers (ViTs) have shown success across a variety of tasks due to their ability to capture global image representations. Recent studies have identified the existence of high-norm tokens in ViTs, which can interfere with unsupervised object discovery. To address this, the use of "registers" which are additional tokens that isolate high norm patch tokens while capturing global image-level information has been proposed. While registers have been studied extensively for object discovery, their generalization properties particularly in out-of-distribution (OOD) scenarios, remains underexplored. In this paper, we examine the utility of register token embeddings in providing additional features for improving generalization and anomaly rejection. To that end, we propose a simple method that combines the special CLS token embedding commonly employed in ViTs with the average-pooled register embeddings to create feature representations which are subsequently used for training a downstream classifier. We find that this enhances OOD generalization and anomaly rejection, while maintaining in-distribution (ID) performance. Extensive experiments across multiple ViT backbones trained with and without registers reveal consistent improvements of 2-4\% in top-1 OOD accuracy and a 2-3\% reduction in false positive rates for anomaly detection. Importantly, these gains are achieved without additional computational overhead.
Paper Structure (7 sections, 1 equation, 2 figures, 1 table)

This paper contains 7 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: Impact of token embedding choices on linear probing on frozen Dino-V2 ViT-G backbones. Each color indicates the token embeddings chosen for optimizing a linear classifier on ImageNet (IN)-1K along with the protocol adopted for pre-training the backbone (w/o registers or w registers). Here, [CLS] represents the classification token, $\mu_P$ and $\mu_R$ represents the mean patch and register token embeddings respectively. While it is common to utilize [CLS] and $\mu_P$ to train the classifier, we obtain improved generalization if the backbone is trained with registers (red vs green). Training a classifier naively with register tokens results in drop in generalization. However, our approach (yellow) maintains ID accuracy while providing substantial gains in OOD (ImageNet-Adversarial, ImageNet-Sketch) accuracies.
  • Figure 2: Overview of our proposed method: For large-scale vision transformer backbones (e.g., DINO-v2) pre-trained with "registers", we find that concatenating registers (Mean (R$_1$, R$_2$, R$_3$, R$_4$) along with [CLS] is critical for obtaining rich features that enable robust adaptation. In particular, we train a linear classifier on these concatenated features and observe improved generalization and anomaly rejection capabilities.