Table of Contents
Fetching ...

Organizing Unstructured Image Collections using Natural Language

Mingxuan Liu, Zhun Zhong, Jun Li, Gianni Franchi, Subhankar Roy, Elisa Ricci

TL;DR

OpenSMC introduces a fully automatic, open-ended clustering paradigm for unstructured image collections. The proposed $\mathcal{X}$-Cluster framework uses a Criteria Proposer and a Semantic Grouper to generate natural-language clustering criteria and their semantic clusters by reasoning over textual representations produced from images, without human priors or fixed cluster counts. The authors introduce COCO-4c and Food-4c benchmarks to evaluate criterion discovery and semantic grouping, and demonstrate applications to uncover biases in text-to-image models and to analyze visual factors driving social media virality. Across six benchmarks, $\mathcal{X}$-Cluster outperforms several criterion-conditioned clustering baselines and shows promise for bias audit and trend analysis, with a training-free design and open-source releases. Potential directions include bias mitigation, fine-grained criteria, and extension to other data modalities.

Abstract

In this work, we introduce and study the novel task of Open-ended Semantic Multiple Clustering (OpenSMC). Given a large, unstructured image collection, the goal is to automatically discover several, diverse semantic clustering criteria (e.g., Activity or Location) from the images, and subsequently organize them according to the discovered criteria, without requiring any human input. Our framework, X-Cluster: eXploratory Clustering, treats text as a reasoning proxy: it concurrently scans the entire image collection, proposes candidate criteria in natural language, and groups images into meaningful clusters per criterion. This radically differs from previous works, which either assume predefined clustering criteria or fixed cluster counts. To evaluate X-Cluster, we create two new benchmarks, COCO-4C and Food-4C, each annotated with four distinct grouping criteria and corresponding cluster labels. Experiments show that X-Cluster can effectively reveal meaningful partitions on several datasets. Finally, we use X-Cluster to achieve various real-world applications, including uncovering hidden biases in text-to-image (T2I) generative models and analyzing image virality on social media. Code and datasets will be open-sourced for future research.

Organizing Unstructured Image Collections using Natural Language

TL;DR

OpenSMC introduces a fully automatic, open-ended clustering paradigm for unstructured image collections. The proposed -Cluster framework uses a Criteria Proposer and a Semantic Grouper to generate natural-language clustering criteria and their semantic clusters by reasoning over textual representations produced from images, without human priors or fixed cluster counts. The authors introduce COCO-4c and Food-4c benchmarks to evaluate criterion discovery and semantic grouping, and demonstrate applications to uncover biases in text-to-image models and to analyze visual factors driving social media virality. Across six benchmarks, -Cluster outperforms several criterion-conditioned clustering baselines and shows promise for bias audit and trend analysis, with a training-free design and open-source releases. Potential directions include bias mitigation, fine-grained criteria, and extension to other data modalities.

Abstract

In this work, we introduce and study the novel task of Open-ended Semantic Multiple Clustering (OpenSMC). Given a large, unstructured image collection, the goal is to automatically discover several, diverse semantic clustering criteria (e.g., Activity or Location) from the images, and subsequently organize them according to the discovered criteria, without requiring any human input. Our framework, X-Cluster: eXploratory Clustering, treats text as a reasoning proxy: it concurrently scans the entire image collection, proposes candidate criteria in natural language, and groups images into meaningful clusters per criterion. This radically differs from previous works, which either assume predefined clustering criteria or fixed cluster counts. To evaluate X-Cluster, we create two new benchmarks, COCO-4C and Food-4C, each annotated with four distinct grouping criteria and corresponding cluster labels. Experiments show that X-Cluster can effectively reveal meaningful partitions on several datasets. Finally, we use X-Cluster to achieve various real-world applications, including uncovering hidden biases in text-to-image (T2I) generative models and analyzing image virality on social media. Code and datasets will be open-sourced for future research.
Paper Structure (45 sections, 3 equations, 24 figures, 35 tables)

This paper contains 45 sections, 3 equations, 24 figures, 35 tables.

Figures (24)

  • Figure 1: All three variants of the proposed $\bm{\mathcal{X}}$-Cluster framework. We explore different design choices for both the Criteria Proposer (left) and the Semantic Grouper (right), and designate the best-performing Caption-based system (marked with ) as the main $\mathcal{X}$-Cluster configuration in our experiments.
  • Figure 2: OpenSMC benchmarks. We introduce two new challenging benchmarks: COCO-4c and Food-4c. We show all annotated criteria and the corresponding labels for the example images.
  • Figure 2: Example predicted clusters of COCO-4c.
  • Figure 3: $\bm{\mathcal{X}}$-Cluster consists of a Criteria Proposer and a Semantic Grouper. (left) Given a set of images, the Proposer discovers and outputs a pool of grouping criteria in natural language. (right) The Grouper subsequently extracts criterion-specific descriptions from images relevant to each criterion, discovers the underlying semantic clusters, and groups each image at three semantic granularity levels. Results shows an example, as how an unstructured image collection can be grouped into clusters of different semantic granularity corresponding to criterion "Location". See Supp. \ref{['sec:app_prompt']} for implementation and prompt details.
  • Figure 3: Example predicted clusters of Food-4c.
  • ...and 19 more figures