Keep It Light! Simplifying Image Clustering Via Text-Free Adapters
Yicen Li, Haitz Sáez de Ocáriz Borde, Anastasis Kratsios, Paul D. McNicholas
TL;DR
This work questions the necessity of text-based components in deep image clustering and introduces SCP, a text-free adapter that leverages pre-trained vision encoders. It provides a theoretical foundation—the Lossless Amortization Principle—that a text-free representation can approximate text-dependent classifiers under ideal conditions. The method uses a frozen backbone with a trainable clustering head and trains with cross-view consistency, a confidence term, and entropy regularization. Empirically, SCP variants using CLIP or DINO features achieve competitive or state-of-the-art clustering on CIFAR, STL-10, ImageNet subsets, and challenging datasets, with strong text-free performance and broad applicability. This approach offers a practical, scalable alternative for real-world clustering when text data or multimodal models are not available.
Abstract
In the era of pre-trained models, effective classification can often be achieved using simple linear probing or lightweight readout layers. In contrast, many competitive clustering pipelines have a multi-modal design, leveraging large language models (LLMs) or other text encoders, and text-image pairs, which are often unavailable in real-world downstream applications. Additionally, such frameworks are generally complicated to train and require substantial computational resources, making widespread adoption challenging. In this work, we show that in deep clustering, competitive performance with more complex state-of-the-art methods can be achieved using a text-free and highly simplified training pipeline. In particular, our approach, Simple Clustering via Pre-trained models (SCP), trains only a small cluster head while leveraging pre-trained vision model feature representations and positive data pairs. Experiments on benchmark datasets, including CIFAR-10, CIFAR-20, CIFAR-100, STL-10, ImageNet-10, and ImageNet-Dogs, demonstrate that SCP achieves highly competitive performance. Furthermore, we provide a theoretical result explaining why, at least under ideal conditions, additional text-based embeddings may not be necessary to achieve strong clustering performance in vision.
