Vision Transformers with Self-Distilled Registers
Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, Andrew F. Luo
TL;DR
Vision Transformers often develop artifact tokens that hinder fine-grained localization. PH-Reg introduces a post hoc self-distillation framework that adds a lightweight set of register tokens to an existing pre-trained ViT, using a frozen teacher and test-time augmentation to generate denoised targets for a minimally updated student. The approach achieves cleaner dense representations and consistent improvements on open-vocabulary segmentation, linear-probe segmentation, and depth prediction, with substantial gains across multiple backbones and datasets. This method enables high-precision ViT localization with minimal computational overhead and no labeled data or full retraining, increasing practical applicability to pre-trained vision models.
Abstract
Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of existing large-scale pre-trained ViTs, in this paper we seek add register tokens to existing models without needing to re-train from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.
