Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

Tianzhe Chu; Shengbang Tong; Tianjiao Ding; Xili Dai; Benjamin David Haeffele; René Vidal; Yi Ma

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

Tianzhe Chu, Shengbang Tong, Tianjiao Ding, Xili Dai, Benjamin David Haeffele, René Vidal, Yi Ma

TL;DR

This paper introduces CPP (Clustering via the Principle of rate Reduction and Pretrained models), a scalable image clustering pipeline that fuses CLIP-based representations with the Maximal Coding Rate Reduction (MCR^2) objective to learn a structured embedding and a doubly stochastic clustering matrix. It includes a non-retraining model-selection mechanism to estimate the optimal number of clusters via a coding-length criterion, and a simple self-labeling step that generates meaningful cluster captions by exploiting CLIP's text–image alignment. Empirical results show state-of-the-art clustering performance on CIFAR-10/20/100 and ImageNet-1k, with demonstrated effectiveness on large uncurated datasets like MS-COCO and LAION-Aesthetic, and a WikiArt case study illustrating applicability to art domains. The approach yields more structured representations than CLIP alone, improves image-to-image search, and provides semantically interpretable cluster labels, making it practical for large-scale, real-world clustering tasks.

Abstract

The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful captions for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets that are not curated for clustering, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

TL;DR

Abstract

Paper Structure (23 sections, 7 equations, 14 figures, 9 tables, 2 algorithms)

This paper contains 23 sections, 7 equations, 14 figures, 9 tables, 2 algorithms.

Motivation
Related Work
Our Method
Review of Manifold Linearizing and Clustering
Training MLC: Leveraging and Refining CLIP Features
Determining Number of Clusters without Retraining
Cluster Captioning and Image-to-Image Search
Experiments
Comparison with Deep Clustering Methods
Comparison of Features Learned by CLIP and CPP
Clustering and Captioning on Large Uncurated Image Datasets
WikiArt: A case study of image clustering in the age of pre-training
Conclusion and Discussion
Additional Quantitative Results and Clarifications
Comparison of Different Pre-trained Visual Models
...and 8 more sections

Figures (14)

Figure 1: Overall pipeline of CPP. Left: In the training stage, CPP initializes the features ${\mathcal{Z}}$ and cluster membership $\boldsymbol{\Pi}$ from a large pre-trained model, and updates ${\mathcal{Z}}$ and $\boldsymbol{\Pi}$ by optimizing the \ref{['eq:mcr2-clustering']} objective. Middle: Once training is done, CPP selects the optimal number of clusters via the coding length $L(\cdot)$ criteria. Right: CPP assigns semantic captions to each cluster via computing cosine similarities between text candidates and images and voting for the most suitable caption.
Figure 2: Structured representations learned by CPP.Left: An example of image-to-image search on ImageNet, using representations provided by CLIP (Top) and CPP (Bottom). Right: Cosine similarity $|{\mathcal{Z}}^\top{\mathcal{Z}}|$ visualization. Clear block-diagonal structures emerge in CPP-learned representations (Bottom), while the ones learned by CLIP show strong sample-wise correlation (Top).
Figure 3: Normalized singular values of CIFAR-100 features.Left: full dataset features. Middle: cluster-wise features from CLIP, membership given by KMeans. Right: cluster-wise features from CPP, membership given by spectral clustering upon membership matrix.
Figure 4: Model selection for clustering without knowing the number of clusters using \ref{['algo:unknown-n-cluster']}. For each dataset, the elbow point of the curve indicates the optimal number (in parenthesis) of clusters. The model selection is done efficiently without any retraining. See \ref{['appendix: optimal_clusters_moreresults']} for more results.
Figure 5: Examples of cluster captioning on MS-COCO (Left) and LAION-Aesthetics (Right). More visualization can be found in \ref{['appendix: cluster&label']}.
...and 9 more figures

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

TL;DR

Abstract

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)