Table of Contents
Fetching ...

ImageSet2Text: Describing Sets of Images through Text

Piera Riccio, Francesco Galati, Kajetan Schweighofer, Noa Garcia, Nuria Oliver

TL;DR

ImageSet2Text tackles the problem of describing large image collections by integrating an iterative VQA loop with an external lexical graph and CVL-based verification to construct a growing concept graph. The method combines data-driven and symbolic reasoning, using a small random subset to hypothesize concepts, expand them through a lexical hierarchy, and verify them across the full set before updating the graph and generating a final description. It is evaluated on two new benchmarks for group-captioning, set-difference captioning, and a user study, showing superior accuracy, completeness, and user satisfaction compared to strong baselines and ablations that justify the hybrid design. The work demonstrates practical impact for dataset understanding, bias detection, accessibility, and explainable AI, while also detailing scalability properties and limitations of relying on WordNet/CLIP within a multi-model pipeline.

Abstract

In the era of large-scale visual data, understanding collections of images is a challenging yet important task. To this end, we introduce ImageSet2Text, a novel method to automatically generate natural language descriptions of image sets. Based on large language models, visual-question answering chains, an external lexical graph, and CLIP-based verification, ImageSet2Text iteratively extracts key concepts from image subsets and organizes them into a structured concept graph. We conduct extensive experiments evaluating the quality of the generated descriptions in terms of accuracy, completeness, and user satisfaction. We also examine the method's behavior through ablation studies, scalability assessments, and failure analyses. Results demonstrate that ImageSet2Text combines data-driven AI and symbolic representations to reliably summarize large image collections for a wide range of applications.

ImageSet2Text: Describing Sets of Images through Text

TL;DR

ImageSet2Text tackles the problem of describing large image collections by integrating an iterative VQA loop with an external lexical graph and CVL-based verification to construct a growing concept graph. The method combines data-driven and symbolic reasoning, using a small random subset to hypothesize concepts, expand them through a lexical hierarchy, and verify them across the full set before updating the graph and generating a final description. It is evaluated on two new benchmarks for group-captioning, set-difference captioning, and a user study, showing superior accuracy, completeness, and user satisfaction compared to strong baselines and ablations that justify the hybrid design. The work demonstrates practical impact for dataset understanding, bias detection, accessibility, and explainable AI, while also detailing scalability properties and limitations of relying on WordNet/CLIP within a multi-model pipeline.

Abstract

In the era of large-scale visual data, understanding collections of images is a challenging yet important task. To this end, we introduce ImageSet2Text, a novel method to automatically generate natural language descriptions of image sets. Based on large language models, visual-question answering chains, an external lexical graph, and CLIP-based verification, ImageSet2Text iteratively extracts key concepts from image subsets and organizes them into a structured concept graph. We conduct extensive experiments evaluating the quality of the generated descriptions in terms of accuracy, completeness, and user satisfaction. We also examine the method's behavior through ablation studies, scalability assessments, and failure analyses. Results demonstrate that ImageSet2Text combines data-driven AI and symbolic representations to reliably summarize large image collections for a wide range of applications.

Paper Structure

This paper contains 55 sections, 1 equation, 15 figures, 9 tables.

Figures (15)

  • Figure 1: ImageSet2Text generates detailed and nuanced descriptions from large sets of images. We report exemplary generated descriptions for two groups of images sharma2018artgan2018.
  • Figure 2: ImageSet2Text's evaluation considers three key properties of the descriptions (accuracy, completeness and user satisfaction) and analyzes the method's behavior through ablations, scalability estimates and failure cases analysis, showing great versatility for potential applications.
  • Figure 3: Overview of ImageSet2Text, considering an example set from the PairedImageSets datasets visdiff. The figure shows how the different modules of the iterative process allow inferring information from the input image set, eventually generating a nuanced textual description.
  • Figure 4: Accuracy as average rank across seven metrics on GroupConceptualCaptions (left) and GroupWikiArt (right). Only top ten methods shown; full results in App. \ref{['apx:sec:detailed_results']}.
  • Figure 5: User study results. For control, values are of those designed to assess clarity, accuracy, detail, and flow, respectively, whereas overall and satisfaction are averaged across all control descriptions. Purple bars indicate medians, dots show means with the actual values reported in the figure.
  • ...and 10 more figures