Table of Contents
Fetching ...

The Hidden Uniform Cluster Prior in Self-Supervised Learning

Mahmoud Assran, Randall Balestriero, Quentin Duval, Florian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Nicolas Ballas

TL;DR

The paper shows that common self-supervised joint-embedding losses enforce a uniform cluster prior, which harms learning on class-imbalanced data. It formalizes this as a K-means-like implicit/explicit clustering bias and demonstrates the negative impact via extensive experiments, including prototype visualizations. To address this, it introduces Prior Matching for Siamese Networks (PMSN), extending MSN to arbitrary priors (notably power-law), and demonstrates improved semantic transfer on long-tailed datasets like iNaturalist when priors are matched to data distribution. The work also provides both toy and real-data analyses, including visualizations of learned prototypes, to illustrate how prior choice shapes the semantic content of learned representations.

Abstract

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

The Hidden Uniform Cluster Prior in Self-Supervised Learning

TL;DR

The paper shows that common self-supervised joint-embedding losses enforce a uniform cluster prior, which harms learning on class-imbalanced data. It formalizes this as a K-means-like implicit/explicit clustering bias and demonstrates the negative impact via extensive experiments, including prototype visualizations. To address this, it introduces Prior Matching for Siamese Networks (PMSN), extending MSN to arbitrary priors (notably power-law), and demonstrates improved semantic transfer on long-tailed datasets like iNaturalist when priors are matched to data distribution. The work also provides both toy and real-data analyses, including visualizations of learned prototypes, to illustrate how prior choice shapes the semantic content of learned representations.

Abstract

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.
Paper Structure (44 sections, 4 theorems, 29 equations, 5 figures, 5 tables)

This paper contains 44 sections, 4 theorems, 29 equations, 5 figures, 5 tables.

Key Result

Proposition 1

The explicit K-means problem, defined by learning a set of $K$ centroids $\mu_1,\dots,\mu_{K}$, admits the same global optimum as the implicit K-means problem, defined by learning a cluster membership matrix ${\bm{P}} \in \{0,1\}^{N \times K}$, such that ${\bm{P}}\mathbf{1}_{K}=\mathbf{1}_{N}$.

Figures (5)

  • Figure 1: Impact of uniform cluster prior in K-means when class distribution of data is imbalanced. K-means clustering depicted in color (green vs red). Ground-truth cluster separation depicted with a dotted black line. When uniform feature prior is not satisfied, K-means can identify undesirable features for discriminating between data points.
  • Figure 2: Visualization of prototypes learned by an MSN model pretrained on ImageNet-1K with either class-balanced or class-imbalanced mini-batch distributions. We use RCDM bordes2022high to enable visualization of the prototypes (details in Appendix \ref{['apndx:rcdm']}). Each row corresponds to samples generated by conditioning on a prototype with various random seeds. Features that remain constant across the row depict information contained in the prototypes, whereas features that vary depict information that is not contained. a) When pretraining with class-balanced mini-batches, the emergent features tend to be associated with high level concepts, such as specific ImageNet classes. b) By contrast, when pretraining with class-imbalanced mini-batches, the learned features tend to be associated with low-level concepts, such as shape, pose, or texture.
  • Figure 3: Each row visualizes the nearest neighbours of the references images (first column), in the embedding space of an PMSN model pretrained on grayscaled images with an MNIST digit in the top left corner. The distribution of MNIST digits in the dataset is constructed to follow a long-tailed power-law distribution. b) When pretraining using a power-law prior, the MNIST digit is encoded by the model, and the nearest neighbours all have the same digit class. c) When pretraining with a uniform prior, the MNIST digit information is discarded by the model, and therefore the nearest neighbours have different digit classes.
  • Figure 4: Visualization of prototypes learned by an PMSN model pretrained with either uniform or powerlaw priors on the iNat18 dataset. We use RCDM bordes2022high to enable visualization of the prototypes. Each row corresponds to samples generated by conditioning on a prototype with various random seeds. Features that remain constant across the row depicts information contained in the prototypes, whereas features that vary depict information that is not contained. a) Features that emerge when pretraining with a powerlaw prior on the iNat18 dataset tend to be associated with high level concepts such as specific image classes. b) Features that emerge by pretraining with a uniform prior are largely associated with low-level concepts such as texture.
  • Figure 5: Visual representations of the results of Table \ref{['tb:class_stratified_sampling']}. Methods relying on volume maximization regularizers all exhibit similar performance alteration across diverse transfer tasks.

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4