Table of Contents
Fetching ...

Text-Guided Image Clustering

Andreas Stephan, Lukas Miklautz, Kevin Sidak, Jan Philip Wahle, Bela Gipp, Claudia Plant, Benjamin Roth

TL;DR

The paper tackles the challenge of image clustering by shifting from pixel-based representations to text-derived representations. It proposes a text-guided clustering framework that generates captions and VQA-based text, then clusters the resulting text using TF-IDF or SBERT embeddings; it further augments this with keyword- and prompt-guided knowledge injection and introduces a counting-based cluster explainability mechanism. Across eight diverse image datasets, text-based representations often outperform image features, and knowledge-infused prompts yield additional gains, highlighting the potential of text as an interpretable clustering abstraction. This work suggests a paradigm shift in image clustering, enabling domain-specific guidance and human-readable explanations for clusters, with implications for retrieval, analysis, and exploratory data science.

Abstract

Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.

Text-Guided Image Clustering

TL;DR

The paper tackles the challenge of image clustering by shifting from pixel-based representations to text-derived representations. It proposes a text-guided clustering framework that generates captions and VQA-based text, then clusters the resulting text using TF-IDF or SBERT embeddings; it further augments this with keyword- and prompt-guided knowledge injection and introduces a counting-based cluster explainability mechanism. Across eight diverse image datasets, text-based representations often outperform image features, and knowledge-infused prompts yield additional gains, highlighting the potential of text as an interpretable clustering abstraction. This work suggests a paradigm shift in image clustering, enabling domain-specific guidance and human-readable explanations for clusters, with implications for retrieval, analysis, and exploratory data science.

Abstract

Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.
Paper Structure (19 sections, 7 figures, 8 tables, 1 algorithm)

This paper contains 19 sections, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: A t-SNE visualization of the BLIP-2 image embeddings for the STL10 dataset. While the images are highly similar (blue background), text such as bird and jet clearly distinguishes objects (and clusters).
  • Figure 2: Taxonomy of the text generation processes, structured by the degree of external knowledge. Text is generated BLIP-2 li2023blip2.
  • Figure 3: Effect of the number of captions sampled per image for BLIP-2. The number of captions is depicted on the X-axis, mean and standard deviation of clustering performance are on the Y-axis.
  • Figure 4: Confusion matrices based on three clustering results from text generated with three different VQA prompts. While a similar cluster accuracy is achieved, we observe that the clustering relates to the prompt. In the middle all room clusters are clustered well, on the right side the clustering is not able to distinguish well between dining room, kitchen and restaurant (see corresponding dining room row), but leads to better overall accuracy.
  • Figure 5: Comparison of all used strategies. Find the questions for prompt-guided clustering in Table \ref{['tab:full_questions']}.
  • ...and 2 more figures