Table of Contents
Fetching ...

Keep It Light! Simplifying Image Clustering Via Text-Free Adapters

Yicen Li, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios, Paul D. McNicholas

TL;DR

This work questions the necessity of text-based components in deep image clustering and introduces SCP, a text-free adapter that leverages pre-trained vision encoders. It provides a theoretical foundation—the Lossless Amortization Principle—that a text-free representation can approximate text-dependent classifiers under ideal conditions. The method uses a frozen backbone with a trainable clustering head and trains with cross-view consistency, a confidence term, and entropy regularization. Empirically, SCP variants using CLIP or DINO features achieve competitive or state-of-the-art clustering on CIFAR, STL-10, ImageNet subsets, and challenging datasets, with strong text-free performance and broad applicability. This approach offers a practical, scalable alternative for real-world clustering when text data or multimodal models are not available.

Abstract

In the era of pre-trained models, effective classification can often be achieved using simple linear probing or lightweight readout layers. In contrast, many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets, including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.

Keep It Light! Simplifying Image Clustering Via Text-Free Adapters

TL;DR

This work questions the necessity of text-based components in deep image clustering and introduces SCP, a text-free adapter that leverages pre-trained vision encoders. It provides a theoretical foundation—the Lossless Amortization Principle—that a text-free representation can approximate text-dependent classifiers under ideal conditions. The method uses a frozen backbone with a trainable clustering head and trains with cross-view consistency, a confidence term, and entropy regularization. Empirically, SCP variants using CLIP or DINO features achieve competitive or state-of-the-art clustering on CIFAR, STL-10, ImageNet subsets, and challenging datasets, with strong text-free performance and broad applicability. This approach offers a practical, scalable alternative for real-world clustering when text data or multimodal models are not available.

Abstract

In the era of pre-trained models, effective classification can often be achieved using simple linear probing or lightweight readout layers. In contrast, many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets, including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.

Paper Structure

This paper contains 27 sections, 2 theorems, 16 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Fix $d,D,C\in \mathbb{N}$. Let $X,Z$ be random variables respectively, taking values in $\mathbb{R}^d$ and in $\mathbb{R}^D$, both of which are defined on a common probability space $(\Omega,\mathcal{F},\mathbb{P})$, and suppose that $Z$ is $\sigma(X)$-measurable. For every $\{0,1,\dots,C-1\}$-value then, there is a Borel map $F: (\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))\to (\mathbb{R}^n,\mathcal{B

Figures (5)

  • Figure 1: A overall pipeline for SCP. During training, two augmented views $T^a$ and $T^b$ of an image are generated from the dataset and processed by a frozen feature extractor $f$ and a trainable cluster head $g$ (a five-layer MLP). The objective is to minimize the cross-entropy loss between the outputs of the cluster head $g$ for the two augmented views.
  • Figure 2: The visualization of clustering performance for SCP-CLIP with ViT-B/32 backbone. (Left): An example of an image-to-image search on STL-10, showing clusters produced by CLIP (Top) and SCP (Bottom); (Right): Visualization of clustering performance. SCP-CLIP effectively enhances original CLIP's clustering performance.
  • Figure 3: Visualization of representations learned by different methods on the CIFAR-10 training set, along with the corresponding clustering accuracy (ACC). (a) DINO + K-means. (b) CLIP + K-means. (c) SCP-DINO logits. (d) SCP-CLIP logits.
  • Figure 4: Visualization of learned representations at different training steps on ImageNet-Dogs. Each panel shows the t-SNE embeddings of the DINO-based features, where (a) depicts the raw encoder outputs clustered by K-means, and (b)--(d) show the logits learned by SCP-DINO across successive training stages (77 steps per epoch).
  • Figure 5: Comparison of different loss weights $\alpha$. The solid line shows the mean ARI across five runs, and the shaded region indicates the standard deviation.

Theorems & Definitions (4)

  • Proposition 1: Lossless Amortization Principle (LAR)
  • proof : Proof of Proposition \ref{['prop:LosslessAmort']}
  • Theorem 1: Text-Free DC is Powerful Enough
  • proof : Proof of Theorem \ref{['thrm:DCplus']}