Table of Contents
Fetching ...

Clustering Properties of Self-Supervised Learning

Xi Weng, Jianing An, Xudong Ma, Binhang Qi, Jie Luo, Xi Yang, Jin Song Dong, Lei Huang

TL;DR

This work investigates the clustering properties of self-supervised representations learned by joint embedding architectures and identifies encodings $H$ as the most robust clustering substrate. It introduces Representation Self-Assignment (ReSA), a positive-feedback SSL method that uses online self-clustering of $H$ via the Sinkhorn-Knopp algorithm to guide the embedding learning objective, yielding improved clustering and retrieval of semantic structure. Across extensive benchmarks, ReSA outperforms state-of-the-art SSL methods on small to large-scale datasets, including ImageNet, and demonstrates strong transfer to downstream tasks such as COCO detection and instance segmentation. The approach enhances both fine-grained and coarse-grained clustering, suggesting that encoding-driven clustering can be a powerful driver for scalable, semantically meaningful representations in visual learning.

Abstract

Self-supervised learning (SSL) methods via joint embedding architectures have proven remarkably effective at capturing semantically rich representations with strong clustering properties, magically in the absence of label supervision. Despite this, few of them have explored leveraging these untapped properties to improve themselves. In this paper, we provide an evidence through various metrics that the encoder's output $encoding$ exhibits superior and more stable clustering properties compared to other components. Building on this insight, we propose a novel positive-feedback SSL method, termed Representation Self-Assignment (ReSA), which leverages the model's clustering properties to promote learning in a self-guided manner. Extensive experiments on standard SSL benchmarks reveal that models pretrained with ReSA outperform other state-of-the-art SSL methods by a significant margin. Finally, we analyze how ReSA facilitates better clustering properties, demonstrating that it effectively enhances clustering performance at both fine-grained and coarse-grained levels, shaping representations that are inherently more structured and semantically meaningful.

Clustering Properties of Self-Supervised Learning

TL;DR

This work investigates the clustering properties of self-supervised representations learned by joint embedding architectures and identifies encodings as the most robust clustering substrate. It introduces Representation Self-Assignment (ReSA), a positive-feedback SSL method that uses online self-clustering of via the Sinkhorn-Knopp algorithm to guide the embedding learning objective, yielding improved clustering and retrieval of semantic structure. Across extensive benchmarks, ReSA outperforms state-of-the-art SSL methods on small to large-scale datasets, including ImageNet, and demonstrates strong transfer to downstream tasks such as COCO detection and instance segmentation. The approach enhances both fine-grained and coarse-grained clustering, suggesting that encoding-driven clustering can be a powerful driver for scalable, semantically meaningful representations in visual learning.

Abstract

Self-supervised learning (SSL) methods via joint embedding architectures have proven remarkably effective at capturing semantically rich representations with strong clustering properties, magically in the absence of label supervision. Despite this, few of them have explored leveraging these untapped properties to improve themselves. In this paper, we provide an evidence through various metrics that the encoder's output exhibits superior and more stable clustering properties compared to other components. Building on this insight, we propose a novel positive-feedback SSL method, termed Representation Self-Assignment (ReSA), which leverages the model's clustering properties to promote learning in a self-guided manner. Extensive experiments on standard SSL benchmarks reveal that models pretrained with ReSA outperform other state-of-the-art SSL methods by a significant margin. Finally, we analyze how ReSA facilitates better clustering properties, demonstrating that it effectively enhances clustering performance at both fine-grained and coarse-grained levels, shaping representations that are inherently more structured and semantically meaningful.

Paper Structure

This paper contains 41 sections, 11 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: The positive-feedback SSL framework. It involves the model generating representations that possess semantically clustering information. This clustering information is leveraged to design self-supervised loss function, which is then employed to more effectively guide the model's learning process.
  • Figure 2: The basic notations for joint embedding architectures (JEA) in SSL.
  • Figure 3: Comparison of clustering metrics in encoding$\mathbf{H}_{}$ and embedding$\mathbf{Z}_{}$ across various self-supervised pretrained models. All methods utilize a ResNet-18 encoder pretrained on CIFAR-10 for 1000 epochs. Circular markers represent metrics computed using encodings, while cross markers correspond to metrics derived from embeddings. All metrics are computed on the entire training set, and similar trends can be observed in the validation set.
  • Figure 4: Comparison of linear evaluation accuracy and clustering metrics of encoding$\mathbf{H}_{}$, embedding$\mathbf{Z}_{}$, and the hidden layer outputs within the projector$\mathbf{P}_{}$ during the training process. The experiments are conducted using SimCLR, VICReg, and SwAV, employing a ResNet-18 encoder pretrained on CIFAR-100 for 500 epochs. The projector is a standard three-layer MLP with BN and ReLU activations, containing two hidden linear layers, so their outputs are denoted as $\mathbf{P}_{0}$ and $\mathbf{P}_{1}$.
  • Figure 5: The framework of Representation Self-Assignment (ReSA). Here, no grad. denotes that the operation does not involve gradient propagation, norm signifies that each sample is $L_2$-normalized to compute cosine similarities, and sinkhorn refers to the Sinkhorn-Knopp algorithm used for clustering assignment.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Definition 1.1
  • Definition 1.2