Table of Contents
Fetching ...

On Partial Prototype Collapse in the DINO Family of Self-Supervised Methods

Hariprasath Govindarajan, Per Sidén, Jacob Roll, Fredrik Lindsten

TL;DR

It is shown that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated, and effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations.

Abstract

A prominent self-supervised learning paradigm is to model the representations as clusters, or more generally as a mixture model. Learning to map the data samples to compact representations and fitting the mixture model simultaneously leads to the representation collapse problem. Regularizing the distribution of data points over the clusters is the prevalent strategy to avoid this issue. While this is sufficient to prevent full representation collapse, we show that a partial prototype collapse problem still exists in the DINO family of methods, that leads to significant redundancies in the prototypes. Such prototype redundancies serve as shortcuts for the method to achieve a marginal latent class distribution that matches the prescribed prior. We show that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated. Effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations. We demonstrate that this is especially beneficial when pre-training on a long-tailed fine-grained dataset.

On Partial Prototype Collapse in the DINO Family of Self-Supervised Methods

TL;DR

It is shown that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated, and effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations.

Abstract

A prominent self-supervised learning paradigm is to model the representations as clusters, or more generally as a mixture model. Learning to map the data samples to compact representations and fitting the mixture model simultaneously leads to the representation collapse problem. Regularizing the distribution of data points over the clusters is the prevalent strategy to avoid this issue. While this is sufficient to prevent full representation collapse, we show that a partial prototype collapse problem still exists in the DINO family of methods, that leads to significant redundancies in the prototypes. Such prototype redundancies serve as shortcuts for the method to achieve a marginal latent class distribution that matches the prescribed prior. We show that by encouraging the model to use diverse prototypes, the partial prototype collapse can be mitigated. Effective utilization of the prototypes enables the methods to learn more fine-grained clusters, encouraging more informative representations. We demonstrate that this is especially beneficial when pre-training on a long-tailed fine-grained dataset.

Paper Structure

This paper contains 30 sections, 7 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: (a) The DINO family of methods result in a trivial full representation collapse without any regularization. (b) Using MLCD regularization such as centering and sharpening prevents full representation collapse but a partial prototype collapse still occurs. (c) KoLeo-data proposed in dinov2 spreads the data representations further apart but does not address the partial prototype collapse. Note that the method (both baseline and with KoLeo-data) uses the partial prototype collapse to achieve a MLCD closer to a uniform distribution over all the prototypes. But the MLCD over only the unique prototypes is non-uniform. (d) We propose KoLeo-proto regularization that explicitly encourages diverse prototypes and prevents partial prototype collapse.
  • Figure 2: ImageNet top-1 kNN accuracy with different MLCD regularizations. Probability centering performs better than SK and ME-MAX at different compute budgets.
  • Figure 3: (left) The number of unique prototypes are similar for the baseline and KoLeo-data regularization at different number of initialized prototypes. With KoLeo-proto, most of the initialized prototypes remain unique. This means that the hyperparameter $K$ can meaningfully control the number of learned clusters. (right) The number of initialized prototypes has no impact on the baseline performance. With any form of KoLeo-regularization, more prototypes lead to better performance and KoLeo-proto consistently performs best.
  • Figure 4: t-SNE plot of the $M$ unique prototypes learned by the baseline method and with KoLeo-proto regularization, colored by their redundancy factors $r_m$. There are fewer unique prototypes in the baseline ($M=1806$), noticeable from their sparse spread in the plot. The baseline prototypes are impacted by partial prototype collapse, resulting in high redundancy factors. With KoLeo-proto regularization, the model learns more unique prototypes ($M=7895$) with significantly smaller redundancy factors compared to the baseline. With KoLeo-proto regularization, the method learns diverse prototypes that are well spread over the latent space.
  • Figure 5: For the exact same set of images, the representations after the head (256 dimensional) are visualized using TSNE plots. The points are colored based on the latent class that they belong to and the corresponding prototypes are denoted using the $+$ marker (the prototype markers are slightly shifted to prevent them from blocking some smaller clusters). The images belong to 7 latent classes in the iBOT-vMF baseline and the same images belong to 18 latent classes when the KoLeo-proto regularization is used. Partial prototype collapse in the baseline results in fewer unique prototypes and coarser clusters. KoLeo-proto regularization encourages diverse prototypes which leads to a more fine-grained clustering of the same data.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 4.1: Partial prototype collapse