ImageSet2Text: Describing Sets of Images through Text
Piera Riccio, Francesco Galati, Kajetan Schweighofer, Noa Garcia, Nuria Oliver
TL;DR
ImageSet2Text tackles the problem of describing large image collections by integrating an iterative VQA loop with an external lexical graph and CVL-based verification to construct a growing concept graph. The method combines data-driven and symbolic reasoning, using a small random subset to hypothesize concepts, expand them through a lexical hierarchy, and verify them across the full set before updating the graph and generating a final description. It is evaluated on two new benchmarks for group-captioning, set-difference captioning, and a user study, showing superior accuracy, completeness, and user satisfaction compared to strong baselines and ablations that justify the hybrid design. The work demonstrates practical impact for dataset understanding, bias detection, accessibility, and explainable AI, while also detailing scalability properties and limitations of relying on WordNet/CLIP within a multi-model pipeline.
Abstract
In the era of large-scale visual data, understanding collections of images is a challenging yet important task. To this end, we introduce ImageSet2Text, a novel method to automatically generate natural language descriptions of image sets. Based on large language models, visual-question answering chains, an external lexical graph, and CLIP-based verification, ImageSet2Text iteratively extracts key concepts from image subsets and organizes them into a structured concept graph. We conduct extensive experiments evaluating the quality of the generated descriptions in terms of accuracy, completeness, and user satisfaction. We also examine the method's behavior through ablation studies, scalability assessments, and failure analyses. Results demonstrate that ImageSet2Text combines data-driven AI and symbolic representations to reliably summarize large image collections for a wide range of applications.
