DINO as a von Mises-Fisher mixture model
Hariprasath Govindarajan, Per Sidén, Jacob Roll, Fredrik Lindsten
TL;DR
The paper reframes DINO as a mixture of von Mises-Fisher components on the unit hypersphere and introduces DINO-vMF, which adds a normalized vMF logit term to enable unnormalized prototypes while maintaining stability for large Vision Transformer backbones. By treating prototypes as vMF components with $oldsymbol{mu}^{(k)} = \boldsymbol{w}^{(k)}/\|\boldsymbol{w}^{(k)}\|$ and $\kappa^{(k)} = \|\boldsymbol{w}^{(k)}\|/\tau$, and by incorporating $\log C_p(\kappa^{(k)})$ into logits, the method gains flexibility and better utilization of the latent space. The authors propose an efficient large-$\nu$ approximation for the normalization constant and a probability-space centering scheme to avoid collapse, achieving stable training on ViT-Base and improved downstream performance across linear, kNN, few-shot, retrieval, and transfer tasks. Experiments show that DINO-vMF and iBOT-vMF generally outperform their baselines, with the largest gains for larger models, and reveal that prototype utilization and precision correlate with downstream ease. The work offers a principled link between SSL clustering objectives and probabilistic mixture modeling, suggesting further EM-like interpretations and extensions.
Abstract
Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L^2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L^2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
