Image Clustering Conditioned on Text Criteria

Sehyun Kwon; Jaeseung Park; Minkyu Kim; Jaewoong Cho; Ernest K. Ryu; Kangwook Lee

Image Clustering Conditioned on Text Criteria

Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K. Ryu, Kangwook Lee

TL;DR

This work presents a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models and shows that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.

Abstract

Classical clustering methods do not provide users with direct control of the clustering results, and the clustering results may not be consistent with the relevant criterion that a user has in mind. In this work, we present a new methodology for performing image clustering based on user-specified text criteria by leveraging modern vision-language models and large language models. We call our method Image Clustering Conditioned on Text Criteria (IC|TC), and it represents a different paradigm of image clustering. IC|TC requires a minimal and practical degree of human intervention and grants the user significant control over the clustering results in return. Our experiments show that IC|TC can effectively cluster images with various criteria, such as human action, physical location, or the person's mood, while significantly outperforming baselines.

Image Clustering Conditioned on Text Criteria

TL;DR

Abstract

Paper Structure (60 sections, 2 equations, 17 figures, 23 tables, 4 algorithms)

This paper contains 60 sections, 2 equations, 17 figures, 23 tables, 4 algorithms.

Introduction
Contribution
Task definition: Image clustering conditioned on iteratively refined text criteria
Iterative refinement of text criteria.
Comparison with classical clustering.
Comparison with zero-shot classification.
IC$|$TC: Image Clustering Conditioned on Text Criteria
Step 1: Extract salient features from the image
Step 2: Obtaining cluster names
Step 3: Clustering by assigning images
Iteratively editing the algorithm through text prompt engineering
Producing cluster labels
Experiments
Clustering with varying text criteria
Clustering with varying granularity
...and 45 more sections

Figures (17)

Figure 1: Sample images from clustering results of IC$|$TC. The method finds clusters consistent with the user-specified text criterion. Furthermore, IC$|$TC provides cluster names (texts above each image cluster) along with the clusters, enhancing the interpretability of clustering results.
Figure 1: Clustering with varying text criteria. Accuracies labeled with * are evaluated by having a human provide ground truth labels for 1000 randomly sampled images. In this experiment, we used LLaVA for VLM and GPT-4 for LLM.
Figure 2: The IC$|$TC method. (Step 1) Vision-language model (VLM) extracts detailed relevant textual descriptions of images. (Step 2) Large language model (LLM) identifies the names of the clusters. (Step 3) LLM conducts clustering by assigning each description to the appropriate cluster. The entire procedure is guided by a user-specified text criterion ($\mathbf{TC}$). (Optional $\mathbf{TC}$ Refinement). The user can update the text criterion if the clustering results are unsatisfactory. See Appendix \ref{['appendix:example_pipe_output']} for an unabridged sample output.
Figure 3: Effect of LLM selection.
Figure 4: Example illustrating why the cluster assignment of Step 3 requires the full description of the image.
...and 12 more figures

Image Clustering Conditioned on Text Criteria

TL;DR

Abstract

Image Clustering Conditioned on Text Criteria

Authors

TL;DR

Abstract

Table of Contents

Figures (17)