Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Ioannis Maniadis Metaxas; Georgios Tzimiropoulos; Ioannis Patras

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Ioannis Patras

TL;DR

ExCB tackles collapse in clustering-based self-supervised visual learning by introducing an online cluster-balancing mechanism that measures cluster sizes across batches using hard assignments and adjusts sample-cluster similarities via a balancing operator. The method operates in a teacher–student framework, updating the balancing operator online without requiring large batch sizes, and yields stable training with minimal computational overhead. Empirical results on ImageNet with ResNet50 and ViT backbones show state-of-the-art linear and semi-supervised performance, and competitive results on object detection and segmentation while using substantially fewer training epochs and smaller batches. This approach significantly lowers resource barriers for pretraining and demonstrates robust, scalable unsupervised visual representations.

Abstract

Self-supervised learning has recently emerged as the preeminent pretraining paradigm across and between modalities, with remarkable results. In the image domain specifically, group (or cluster) discrimination has been one of the most successful methods. However, such frameworks need to guard against heavily imbalanced cluster assignments to prevent collapse to trivial solutions. Existing works typically solve this by reweighing cluster assignments to promote balance, or with offline operations (e.g. regular re-clustering) that prevent collapse. However, the former typically requires large batch sizes, which leads to increased resource requirements, and the latter introduces scalability issues with regard to large datasets. In this work, we propose ExCB, a framework that tackles this problem with a novel cluster balancing method. ExCB estimates the relative size of the clusters across batches and balances them by adjusting cluster assignments, proportionately to their relative size and in an online manner. Thereby, it overcomes previous methods' dependence on large batch sizes and is fully online, and therefore scalable to any dataset. We conduct extensive experiments to evaluate our approach and demonstrate that ExCB: a) achieves state-of-the-art results with significantly reduced resource requirements compared to previous works, b) is fully online, and therefore scalable to large datasets, and c) is stable and effective even with very small batch sizes.

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

TL;DR

Abstract

Paper Structure (29 sections, 12 equations, 7 figures, 9 tables, 2 algorithms)

This paper contains 29 sections, 12 equations, 7 figures, 9 tables, 2 algorithms.

Introduction
Related Works
Self-supervised learning for visual data
Clustering-based self-supervised learning
Method
Overview
Online Cluster Balancing
Measuring relative cluster sizes.
Adjusting sample-cluster similarities.
Summary.
Experiments
Implementation details
Architecture & Hyperparameters.
Learning cluster centroids.
Results
...and 14 more sections

Figures (7)

Figure 1: Illustration of ExCB's balancing operator $\mathcal{B}$ for two clusters $c_1$ (red) and $c_2$ (blue). $\mathcal{B}(z;s)$ adjusts sample-cluster cosine similarities $z$ according the relative cluster sizes, as measured in $s$. For smaller clusters the similarities are increased ($z^B>z$), whereas for larger clusters the similarities are decreased ($z^B<z$). The impact, as seen in the figure, is that the boundary between clusters shifts, undersized (oversized) clusters are assigned more (fewer) samples, and clusters become more balanced.
Figure 2: Linear classification accuracy on ImageNet with ResNet50 for different self-supervised methods. Circles indicate pretraining batch size. ExCB achieves state-of-the-art results with the most efficient combination of few epochs and small batch size.
Figure 3: Overview of ExCB. The student is trained so that ${\bm{p}}_s(x")$ matches ${\bm{p}}_t(x')$, where $x'$ and $x"$ are transformed views of $x$. The balancing module $\mathcal{B}$ adjusts cluster assignments to promote uniform distribution between the clusters across the dataset.
Figure 4: ExCB training statistics.
Figure 5: Sample distribution over the clusters for ExCB and DINO. Clusters are sorted according to their relative size, defined as $\frac{N_c K}{N}$, where $N_c$ is the number of samples assigned to that cluster, N=1,281,167 is the number of total samples and K=65,536 is the number of clusters. In each plot, we highlight the optimal relative cluster size of 1 (for $N_c=\frac{N}{K}$) and the empty clusters ($N_c=0$).
...and 2 more figures

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

TL;DR

Abstract

Efficient Unsupervised Visual Representation Learning with Explicit Cluster Balancing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)