Organizing Unstructured Image Collections using Natural Language
Mingxuan Liu, Zhun Zhong, Jun Li, Gianni Franchi, Subhankar Roy, Elisa Ricci
TL;DR
OpenSMC introduces a fully automatic, open-ended clustering paradigm for unstructured image collections. The proposed $\mathcal{X}$-Cluster framework uses a Criteria Proposer and a Semantic Grouper to generate natural-language clustering criteria and their semantic clusters by reasoning over textual representations produced from images, without human priors or fixed cluster counts. The authors introduce COCO-4c and Food-4c benchmarks to evaluate criterion discovery and semantic grouping, and demonstrate applications to uncover biases in text-to-image models and to analyze visual factors driving social media virality. Across six benchmarks, $\mathcal{X}$-Cluster outperforms several criterion-conditioned clustering baselines and shows promise for bias audit and trend analysis, with a training-free design and open-source releases. Potential directions include bias mitigation, fine-grained criteria, and extension to other data modalities.
Abstract
In this work, we introduce and study the novel task of Open-ended Semantic Multiple Clustering (OpenSMC). Given a large, unstructured image collection, the goal is to automatically discover several, diverse semantic clustering criteria (e.g., Activity or Location) from the images, and subsequently organize them according to the discovered criteria, without requiring any human input. Our framework, X-Cluster: eXploratory Clustering, treats text as a reasoning proxy: it concurrently scans the entire image collection, proposes candidate criteria in natural language, and groups images into meaningful clusters per criterion. This radically differs from previous works, which either assume predefined clustering criteria or fixed cluster counts. To evaluate X-Cluster, we create two new benchmarks, COCO-4C and Food-4C, each annotated with four distinct grouping criteria and corresponding cluster labels. Experiments show that X-Cluster can effectively reveal meaningful partitions on several datasets. Finally, we use X-Cluster to achieve various real-world applications, including uncovering hidden biases in text-to-image (T2I) generative models and analyzing image virality on social media. Code and datasets will be open-sourced for future research.
