Text-Guided Image Clustering
Andreas Stephan, Lukas Miklautz, Kevin Sidak, Jan Philip Wahle, Bela Gipp, Claudia Plant, Benjamin Roth
TL;DR
The paper tackles the challenge of image clustering by shifting from pixel-based representations to text-derived representations. It proposes a text-guided clustering framework that generates captions and VQA-based text, then clusters the resulting text using TF-IDF or SBERT embeddings; it further augments this with keyword- and prompt-guided knowledge injection and introduces a counting-based cluster explainability mechanism. Across eight diverse image datasets, text-based representations often outperform image features, and knowledge-infused prompts yield additional gains, highlighting the potential of text as an interpretable clustering abstraction. This work suggests a paradigm shift in image clustering, enabling domain-specific guidance and human-readable explanations for clusters, with implications for retrieval, analysis, and exploratory data science.
Abstract
Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.
