SINDER: Repairing the Singular Defects of DINOv2
Haoqi Wang, Tong Zhang, Mathieu Salzmann
TL;DR
This work identifies high-norm, input-agnostic defective tokens in pre-trained Vision Transformers as singular defects tied to the leading left singular vectors of the network weights. By linearizing Attention and MLP blocks, the authors show defect directions align with these singular vectors and predict them layer-wise, enabling a weight-based, input-agnostic repair. They propose SINDER, a data-efficient fine-tuning method that updates only singular values with a smooth regularization on a small set of layers, avoiding full retraining. Across unsupervised segmentation, classification, supervised segmentation, and depth estimation, SINDER improves dense-prediction performance while preserving cls-token quality and requiring far less resources than re-training, offering a practical, scalable remedy for large SSL-trained transformers.
Abstract
Vision Transformer models trained on large-scale datasets, although effective, often exhibit artifacts in the patch token they extract. While such defects can be alleviated by re-training the entire model with additional classification tokens, the underlying reasons for the presence of these tokens remain unclear. In this paper, we conduct a thorough investigation of this phenomenon, combining theoretical analysis with empirical observations. Our findings reveal that these artifacts originate from the pre-trained network itself, specifically stemming from the leading left singular vector of the network's weights. Furthermore, to mitigate these defects, we propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset, thereby avoiding the need for complete re-training. We validate our method on various downstream tasks, including unsupervised segmentation, classification, supervised segmentation, and depth estimation, demonstrating its effectiveness in improving model performance. Codes and checkpoints are available at https://github.com/haoqiwang/sinder.
