Vision Transformers Need More Than Registers

Cheng Shi; Yizhou Yu; Sibei Yang

Vision Transformers Need More Than Registers

Cheng Shi, Yizhou Yu, Sibei Yang

Abstract

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Vision Transformers Need More Than Registers

Abstract

Paper Structure (30 sections, 7 equations, 13 figures, 12 tables)

This paper contains 30 sections, 7 equations, 13 figures, 12 tables.

Introduction
Related Work
Preliminary
Network Architecture: Vision Transformer
Analysis and Hypothesis
New Metric: Patch Score and Point-in-Box
Artifacts in Patch Score
Coarse-grained Semantic Supervision
Lazy Behavior from ViT’s Global Dependencies
Method
LaSt-ViT
Where does the CLS token in LaSt-ViT look at?
Transfer to Downstream Tasks
Experiment
Experiment Settings
...and 15 more sections

Figures (13)

Figure 1: Patch‑score distribution and masking probe on ImageNet‑1k. (a) Normalized distributions of patch scores for foreground vs. background. (b) Removing top-$k$ high‑score patches (up to $70\%$) does not hurt accuracy and even can improve it.
Figure 2: Training dynamics on ImageNet-1k. Left: top-1 accuracy; Right: Point-in-Box score. As training proceeds, ViT’s classification accuracy steadily improves, yet its Point-in-Box score remains nearly flat (around $0.42 \rightarrow 0.44$) and consistently lower than ResNet’s across the entire training process.
Figure 3: Effect of Coarse-grained semantic supervision. Increasing the patch size reduces background tokens by $10\%$. Effect: Point-in-Box rises from $0.44$ to $0.52$, and high-score patches shift toward foreground. Trade-off: classification accuracy decreases, indicating that coarse-grained semantic supervision contributes to artifacts, while naive patch coarsening compromises recognition.
Figure 4: Where does the CLS token in LaSt-ViT "look at"? For each image, patches whose vote count exceeds 50%, 30%, or 20% of the largest vote count within the image are visualized in red from left to right, respectively. After the application of LaSt-ViT, highly voted patches consistently correspond to foreground regions, showing that the CLS token primarily aggregates foreground tokens rather than background ones.
Figure 5: Evaluation of the LaSt-ViT in feature norm. Specifically, the elimination of artifacts also removes the high-norm phenomena registers, highlighting our deeper perspective on addressing artifacts.
...and 8 more figures

Vision Transformers Need More Than Registers

Abstract

Vision Transformers Need More Than Registers

Authors

Abstract

Table of Contents

Figures (13)