Table of Contents
Fetching ...

Vision Transformers with Self-Distilled Registers

Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, Andrew F. Luo

TL;DR

Vision Transformers often develop artifact tokens that hinder fine-grained localization. PH-Reg introduces a post hoc self-distillation framework that adds a lightweight set of register tokens to an existing pre-trained ViT, using a frozen teacher and test-time augmentation to generate denoised targets for a minimally updated student. The approach achieves cleaner dense representations and consistent improvements on open-vocabulary segmentation, linear-probe segmentation, and depth prediction, with substantial gains across multiple backbones and datasets. This method enables high-precision ViT localization with minimal computational overhead and no labeled data or full retraining, increasing practical applicability to pre-trained vision models.

Abstract

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

Vision Transformers with Self-Distilled Registers

TL;DR

Vision Transformers often develop artifact tokens that hinder fine-grained localization. PH-Reg introduces a post hoc self-distillation framework that adds a lightweight set of register tokens to an existing pre-trained ViT, using a frozen teacher and test-time augmentation to generate denoised targets for a minimally updated student. The approach achieves cleaner dense representations and consistent improvements on open-vocabulary segmentation, linear-probe segmentation, and depth prediction, with substantial gains across multiple backbones and datasets. This method enables high-precision ViT localization with minimal computational overhead and no labeled data or full retraining, increasing practical applicability to pre-trained vision models.

Abstract

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

Paper Structure

This paper contains 30 sections, 6 equations, 15 figures, 11 tables, 1 algorithm.

Figures (15)

  • Figure 1: Effect of PH-Reg on Open-vocabulary Segmentation. For each image, we compare four methods: MaskCLIP which directly takes the value features from the last attention layer; SCLIP which adds correlative self-attention; NACLIP which further enforces a locality bias; and our PH-Reg method with self-distilled registers. We utilize the same OpenAI CLIP ViT-B/16 weights for all three methods. For each method, we visualize the UMAP of the dense features and a heatmap of one text query. Our method yields noticeably cleaner dense features and high quality localizations, and requires only a small set of additional register parameters compared to the original network.
  • Figure 2: Learning Framework of PH-Reg.(a) Our framework begins by creating two networks from the same set of weights. In the teacher, the weights are frozen and unmodified. In the student, the only additional parameters are learnable register tokens. The teacher creates a learning target using denoised representations. (b) An image $\mathcal{I}$ undergoes augmentation by function $\mathcal{T}$ with random augmentation parameters consisting of random offsets and horizontal flips.(c) Given an RGB image, we utilize UMAP to visualize the features, and a heatmap using CLIP text query. Our method can produce significantly cleaner dense representations with minimal additional inference cost.
  • Figure 3: Denoising Teacher Representations with Augmentations. For each model, we visualize the UMAP of dense features before and after applying test-time augmentation. The results show that our proposed method produces noticeably cleaner dense feature representations without requiring gradient-based learning. Please zoom in for details.
  • Figure 4: Visualization of Open-vocabulary Semantic Segmentation. We compare against MaskCLIP, SCLIP, NACLIP, and find that our method yields clean feature maps free of artifacts.
  • Figure 5: Ablation on number of registers and augmentations
  • ...and 10 more figures