Table of Contents
Fetching ...

DINO as a von Mises-Fisher mixture model

Hariprasath Govindarajan, Per Sidén, Jacob Roll, Fredrik Lindsten

TL;DR

The paper reframes DINO as a mixture of von Mises-Fisher components on the unit hypersphere and introduces DINO-vMF, which adds a normalized vMF logit term to enable unnormalized prototypes while maintaining stability for large Vision Transformer backbones. By treating prototypes as vMF components with $oldsymbol{ mu}^{(k)} = \boldsymbol{w}^{(k)}/\|\boldsymbol{w}^{(k)}\|$ and $\kappa^{(k)} = \|\boldsymbol{w}^{(k)}\|/\tau$, and by incorporating $\log C_p(\kappa^{(k)})$ into logits, the method gains flexibility and better utilization of the latent space. The authors propose an efficient large-$\nu$ approximation for the normalization constant and a probability-space centering scheme to avoid collapse, achieving stable training on ViT-Base and improved downstream performance across linear, kNN, few-shot, retrieval, and transfer tasks. Experiments show that DINO-vMF and iBOT-vMF generally outperform their baselines, with the largest gains for larger models, and reveal that prototype utilization and precision correlate with downstream ease. The work offers a principled link between SSL clustering objectives and probabilistic mixture modeling, suggesting further EM-like interpretations and extensions.

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.

DINO as a von Mises-Fisher mixture model

TL;DR

The paper reframes DINO as a mixture of von Mises-Fisher components on the unit hypersphere and introduces DINO-vMF, which adds a normalized vMF logit term to enable unnormalized prototypes while maintaining stability for large Vision Transformer backbones. By treating prototypes as vMF components with and , and by incorporating into logits, the method gains flexibility and better utilization of the latent space. The authors propose an efficient large- approximation for the normalization constant and a probability-space centering scheme to avoid collapse, achieving stable training on ViT-Base and improved downstream performance across linear, kNN, few-shot, retrieval, and transfer tasks. Experiments show that DINO-vMF and iBOT-vMF generally outperform their baselines, with the largest gains for larger models, and reveal that prototype utilization and precision correlate with downstream ease. The work offers a principled link between SSL clustering objectives and probabilistic mixture modeling, suggesting further EM-like interpretations and extensions.

Abstract

Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between -dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are -normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also -normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
Paper Structure (28 sections, 13 equations, 11 figures, 15 tables)

This paper contains 28 sections, 13 equations, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Overview of DINO. (a): High-level architecture of DINO; (b): A closer look at the networks $g_{\theta}$, modeled as a combination of a backbone $f_{\phi}$ and a prediction head $h_{\psi, W}$, where $\theta = \{\phi, \psi, W\}$. The prediction head contains 3 MLP layers, an $L^2$-normalization bottleneck and a weight-normalized weight_norm linear layer. The weights of the weight-normalized linear layer are $L^2$-normalized in the larger ViT-Base models to ensure stable training.
  • Figure 1: ImageNet kNN classification accuracy ablating on the impact of $L^2$-normalization of prototypes, vMF normalization and probability centering. Average over 2 runs are reported. Refer \ref{['sec:app_ablation_studies']} for results of individual runs.
  • Figure 2: kNN accuracy for data sorted based on percentile ranges of associated $\| \boldsymbol{w}^{(k)} \|$.
  • Figure 3: von Mises-Fisher density on the circle for a prototype vector pointing in the direction of $315^{\circ}$ for two different values of prototype magnitudes (larger magnitude: blue curve, smaller magnitude: red curve).
  • Figure 4: Our approximation up to a constant, $\log C^{(a)}_p(\kappa)$.
  • ...and 6 more figures