DINO as a von Mises-Fisher mixture model

Hariprasath Govindarajan; Per Sidén; Jacob Roll; Fredrik Lindsten

DINO as a von Mises-Fisher mixture model

Hariprasath Govindarajan, Per Sidén, Jacob Roll, Fredrik Lindsten

TL;DR

The paper reframes DINO as a mixture of von Mises-Fisher components on the unit hypersphere and introduces DINO-vMF, which adds a normalized vMF logit term to enable unnormalized prototypes while maintaining stability for large Vision Transformer backbones. By treating prototypes as vMF components with $oldsymbol{mu}^{(k)} = \boldsymbol{w}^{(k)}/\|\boldsymbol{w}^{(k)}\|$ and $\kappa^{(k)} = \|\boldsymbol{w}^{(k)}\|/\tau$, and by incorporating $\log C_p(\kappa^{(k)})$ into logits, the method gains flexibility and better utilization of the latent space. The authors propose an efficient large-$\nu$ approximation for the normalization constant and a probability-space centering scheme to avoid collapse, achieving stable training on ViT-Base and improved downstream performance across linear, kNN, few-shot, retrieval, and transfer tasks. Experiments show that DINO-vMF and iBOT-vMF generally outperform their baselines, with the largest gains for larger models, and reveal that prototype utilization and precision correlate with downstream ease. The work offers a principled link between SSL clustering objectives and probabilistic mixture modeling, suggesting further EM-like interpretations and extensions.

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

DINO as a von Mises-Fisher mixture model

TL;DR

and

, and by incorporating

into logits, the method gains flexibility and better utilization of the latent space. The authors propose an efficient large-

approximation for the normalization constant and a probability-space centering scheme to avoid collapse, achieving stable training on ViT-Base and improved downstream performance across linear, kNN, few-shot, retrieval, and transfer tasks. Experiments show that DINO-vMF and iBOT-vMF generally outperform their baselines, with the largest gains for larger models, and reveal that prototype utilization and precision correlate with downstream ease. The work offers a principled link between SSL clustering objectives and probabilistic mixture modeling, suggesting further EM-like interpretations and extensions.

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between

-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are

-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also

-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

Paper Structure (28 sections, 13 equations, 11 figures, 15 tables)

This paper contains 28 sections, 13 equations, 11 figures, 15 tables.

Introduction
DINO
Method
DINO: A closer look at the final layer
DINO as a Mixture Model
Normalizing von Mises-Fisher (vMF) components
Avoiding Collapse
Related Work
Experiments
ImageNet Classification
Ablation studies
ImageNet classification with full dataset
ImageNet few-shot evaluation
Analysis of learned vMF mixture model
Transfer learning
...and 13 more sections

Figures (11)

Figure 1: Overview of DINO. (a): High-level architecture of DINO; (b): A closer look at the networks $g_{\theta}$, modeled as a combination of a backbone $f_{\phi}$ and a prediction head $h_{\psi, W}$, where $\theta = \{\phi, \psi, W\}$. The prediction head contains 3 MLP layers, an $L^2$-normalization bottleneck and a weight-normalized weight_norm linear layer. The weights of the weight-normalized linear layer are $L^2$-normalized in the larger ViT-Base models to ensure stable training.
Figure 1: ImageNet kNN classification accuracy ablating on the impact of $L^2$-normalization of prototypes, vMF normalization and probability centering. Average over 2 runs are reported. Refer \ref{['sec:app_ablation_studies']} for results of individual runs.
Figure 2: kNN accuracy for data sorted based on percentile ranges of associated $\| \boldsymbol{w}^{(k)} \|$.
Figure 3: von Mises-Fisher density on the circle for a prototype vector pointing in the direction of $315^{\circ}$ for two different values of prototype magnitudes (larger magnitude: blue curve, smaller magnitude: red curve).
Figure 4: Our approximation up to a constant, $\log C^{(a)}_p(\kappa)$.
...and 6 more figures

DINO as a von Mises-Fisher mixture model

TL;DR

Abstract

DINO as a von Mises-Fisher mixture model

Authors

TL;DR

Abstract

Table of Contents

Figures (11)