Table of Contents
Fetching ...

Image Clustering Conditioned on Text Criteria

Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K. Ryu, Kangwook Lee

TL;DR

This work presents a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models and shows that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.

Abstract

Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.

Image Clustering Conditioned on Text Criteria

TL;DR

This work presents a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models and shows that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.

Abstract

Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.
Paper Structure (60 sections, 2 equations, 17 figures, 23 tables, 4 algorithms)

This paper contains 60 sections, 2 equations, 17 figures, 23 tables, 4 algorithms.

Figures (17)

  • Figure 1: Sample images from clustering results of IC$|$TC. The method finds clusters consistent with the user-specified text criterion. Furthermore, IC$|$TC provides cluster names (texts above each image cluster) along with the clusters, enhancing the interpretability of clustering results.
  • Figure 1: Clustering with varying text criteria. Accuracies labeled with * are evaluated by having a human provide ground truth labels for 1000 randomly sampled images. In this experiment, we used LLaVA for VLM and GPT-4 for LLM.
  • Figure 2: The IC$|$TC method. (Step 1) Vision-language model (VLM) extracts detailed relevant textual descriptions of images. (Step 2) Large language model (LLM) identifies the names of the clusters. (Step 3) LLM conducts clustering by assigning each description to the appropriate cluster. The entire procedure is guided by a user-specified text criterion ($\mathbf{TC}$). (Optional $\mathbf{TC}$ Refinement). The user can update the text criterion if the clustering results are unsatisfactory. See Appendix \ref{['appendix:example_pipe_output']} for an unabridged sample output.
  • Figure 3: Effect of LLM selection.
  • Figure 4: Example illustrating why the cluster assignment of Step 3 requires the full description of the image.
  • ...and 12 more figures