Table of Contents
Fetching ...

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

Tianzhe Chu, Shengbang Tong, Tianjiao Ding, Xili Dai, Benjamin David Haeffele, René Vidal, Yi Ma

TL;DR

This paper introduces CPP (Clustering via the Principle of rate Reduction and Pretrained models), a scalable image clustering pipeline that fuses CLIP-based representations with the Maximal Coding Rate Reduction (MCR^2) objective to learn a structured embedding and a doubly stochastic clustering matrix. It includes a non-retraining model-selection mechanism to estimate the optimal number of clusters via a coding-length criterion, and a simple self-labeling step that generates meaningful cluster captions by exploiting CLIP's text–image alignment. Empirical results show state-of-the-art clustering performance on CIFAR-10/20/100 and ImageNet-1k, with demonstrated effectiveness on large uncurated datasets like MS-COCO and LAION-Aesthetic, and a WikiArt case study illustrating applicability to art domains. The approach yields more structured representations than CLIP alone, improves image-to-image search, and provides semantically interpretable cluster labels, making it practical for large-scale, real-world clustering tasks.

Abstract

The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful captions for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets that are not curated for clustering, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.

Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models

TL;DR

This paper introduces CPP (Clustering via the Principle of rate Reduction and Pretrained models), a scalable image clustering pipeline that fuses CLIP-based representations with the Maximal Coding Rate Reduction (MCR^2) objective to learn a structured embedding and a doubly stochastic clustering matrix. It includes a non-retraining model-selection mechanism to estimate the optimal number of clusters via a coding-length criterion, and a simple self-labeling step that generates meaningful cluster captions by exploiting CLIP's text–image alignment. Empirical results show state-of-the-art clustering performance on CIFAR-10/20/100 and ImageNet-1k, with demonstrated effectiveness on large uncurated datasets like MS-COCO and LAION-Aesthetic, and a WikiArt case study illustrating applicability to art domains. The approach yields more structured representations than CLIP alone, improves image-to-image search, and provides semantically interpretable cluster labels, making it practical for large-scale, real-world clustering tasks.

Abstract

The advent of large pre-trained models has brought about a paradigm shift in both visual representation learning and natural language processing. However, clustering unlabeled images, as a fundamental and classic machine learning problem, still lacks an effective solution, particularly for large-scale datasets. In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale. We first developed a novel algorithm to estimate the number of clusters in a given dataset. We then show that the pre-trained features are significantly more structured by further optimizing the rate reduction objective. The resulting features may significantly improve the clustering accuracy, e.g., from 57\% to 66\% on ImageNet-1k. Furthermore, by leveraging CLIP's multimodality bridge between image and text, we develop a simple yet effective self-labeling algorithm that produces meaningful captions for the clusters. Through extensive experiments, we show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k. It also extends to datasets that are not curated for clustering, such as LAION-Aesthetics and WikiArts. We released the code in https://github.com/LeslieTrue/CPP.
Paper Structure (23 sections, 7 equations, 14 figures, 9 tables, 2 algorithms)

This paper contains 23 sections, 7 equations, 14 figures, 9 tables, 2 algorithms.

Figures (14)

  • Figure 1: Overall pipeline of CPP. Left: In the training stage, CPP initializes the features ${\mathcal{Z}}$ and cluster membership $\boldsymbol{\Pi}$ from a large pre-trained model, and updates ${\mathcal{Z}}$ and $\boldsymbol{\Pi}$ by optimizing the \ref{['eq:mcr2-clustering']} objective. Middle: Once training is done, CPP selects the optimal number of clusters via the coding length $L(\cdot)$ criteria. Right: CPP assigns semantic captions to each cluster via computing cosine similarities between text candidates and images and voting for the most suitable caption.
  • Figure 2: Structured representations learned by CPP.Left: An example of image-to-image search on ImageNet, using representations provided by CLIP (Top) and CPP (Bottom). Right: Cosine similarity $|{\mathcal{Z}}^\top{\mathcal{Z}}|$ visualization. Clear block-diagonal structures emerge in CPP-learned representations (Bottom), while the ones learned by CLIP show strong sample-wise correlation (Top).
  • Figure 3: Normalized singular values of CIFAR-100 features.Left: full dataset features. Middle: cluster-wise features from CLIP, membership given by KMeans. Right: cluster-wise features from CPP, membership given by spectral clustering upon membership matrix.
  • Figure 4: Model selection for clustering without knowing the number of clusters using \ref{['algo:unknown-n-cluster']}. For each dataset, the elbow point of the curve indicates the optimal number (in parenthesis) of clusters. The model selection is done efficiently without any retraining. See \ref{['appendix: optimal_clusters_moreresults']} for more results.
  • Figure 5: Examples of cluster captioning on MS-COCO (Left) and LAION-Aesthetics (Right). More visualization can be found in \ref{['appendix: cluster&label']}.
  • ...and 9 more figures