Table of Contents
Fetching ...

SINDER: Repairing the Singular Defects of DINOv2

Haoqi Wang, Tong Zhang, Mathieu Salzmann

TL;DR

This work identifies high-norm, input-agnostic defective tokens in pre-trained Vision Transformers as singular defects tied to the leading left singular vectors of the network weights. By linearizing Attention and MLP blocks, the authors show defect directions align with these singular vectors and predict them layer-wise, enabling a weight-based, input-agnostic repair. They propose SINDER, a data-efficient fine-tuning method that updates only singular values with a smooth regularization on a small set of layers, avoiding full retraining. Across unsupervised segmentation, classification, supervised segmentation, and depth estimation, SINDER improves dense-prediction performance while preserving cls-token quality and requiring far less resources than re-training, offering a practical, scalable remedy for large SSL-trained transformers.

Abstract

Vision Transformer models trained on large-scale datasets, although effective, often exhibit artifacts in the patch token they extract. While such defects can be alleviated by re-training the entire model with additional classification tokens, the underlying reasons for the presence of these tokens remain unclear. In this paper, we conduct a thorough investigation of this phenomenon, combining theoretical analysis with empirical observations. Our findings reveal that these artifacts originate from the pre-trained network itself, specifically stemming from the leading left singular vector of the network's weights. Furthermore, to mitigate these defects, we propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset, thereby avoiding the need for complete re-training. We validate our method on various downstream tasks, including unsupervised segmentation, classification, supervised segmentation, and depth estimation, demonstrating its effectiveness in improving model performance. Codes and checkpoints are available at https://github.com/haoqiwang/sinder.

SINDER: Repairing the Singular Defects of DINOv2

TL;DR

This work identifies high-norm, input-agnostic defective tokens in pre-trained Vision Transformers as singular defects tied to the leading left singular vectors of the network weights. By linearizing Attention and MLP blocks, the authors show defect directions align with these singular vectors and predict them layer-wise, enabling a weight-based, input-agnostic repair. They propose SINDER, a data-efficient fine-tuning method that updates only singular values with a smooth regularization on a small set of layers, avoiding full retraining. Across unsupervised segmentation, classification, supervised segmentation, and depth estimation, SINDER improves dense-prediction performance while preserving cls-token quality and requiring far less resources than re-training, offering a practical, scalable remedy for large SSL-trained transformers.

Abstract

Vision Transformer models trained on large-scale datasets, although effective, often exhibit artifacts in the patch token they extract. While such defects can be alleviated by re-training the entire model with additional classification tokens, the underlying reasons for the presence of these tokens remain unclear. In this paper, we conduct a thorough investigation of this phenomenon, combining theoretical analysis with empirical observations. Our findings reveal that these artifacts originate from the pre-trained network itself, specifically stemming from the leading left singular vector of the network's weights. Furthermore, to mitigate these defects, we propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset, thereby avoiding the need for complete re-training. We validate our method on various downstream tasks, including unsupervised segmentation, classification, supervised segmentation, and depth estimation, demonstrating its effectiveness in improving model performance. Codes and checkpoints are available at https://github.com/haoqiwang/sinder.
Paper Structure (40 sections, 8 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 40 sections, 8 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualization of singular defects in the feature map of the last layer of DINOv2. The images are resized to have height $896$ when input into the networks. The color of the PCA visualization comes from the three principal components of the patch tokens.
  • Figure 2: Angle between theoretical and empirical defect directions. Blue lines are the angle between the empirical defect direction and the leading left singular vector of $I+A$, $I+C$, $E$, $G$, respectively. Angles between the leading left singular vectors and all the patch tokens in each layer are shown as violin plots. The $x$-axis is the layer index, and the $y$-axis is the acute angle in degrees. The villa image in Figure \ref{['fig:high_norm']} is used.
  • Figure 3: Visualization of unsupervised segmentation on Cityscapes using STEGO.
  • Figure 4: Visualization after clamping the singular values of linear layers. The results of the two images are illustrated. The first and third columns are the PCA visualization of the feature map in the last layer. The second and fourth columns are the violin plots of the norm of the corresponding tokens.
  • Figure 5: The violin plot is the visualization of angles between the theoretical singular defect direction $\nu_i$ and patch tokens. The first row below the violin plot is the PCA visualization of patch tokens in the 9th, 19th, 29th, and 39th layers. The second row is the heat map of the angle between $\nu_i$ and patch tokens. The darker the color, the smaller the angles. The third row is the defective tokens detected by logits defined in Equation (5). The last three rows are the learning target under the temperature hyper-parameter $\tau=0.01, 0.1, 1$. We use $\tau=0.1$ in our experiments.
  • ...and 2 more figures