Table of Contents
Fetching ...

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi

TL;DR

This work introduces Graph-Based Captioning (GBC), a graph-structured, text-rich image annotation framework with four node types (image, entity, composition, relation) to capture hierarchical and relational content. It proposes an end-to-end workflow that auto-generates GBC annotations at scale, culminating in two large datasets (GBC1M and GBC10M) built on CC12M and released under CC BY-NC 4.0. Through CLIP training, GBC demonstrates clear advantages over traditional captions, particularly when leveraging composition and relation nodes, and introduces a structure-aware attention mechanism and multi-positive contrastive loss to exploit graph information. The authors also show GBC as middleware for text-to-image generation, enabling finer-grained control by conditioning generation on graph structure, thereby improving alignment with user intent and expanding the practical utility of vision-language models.

Abstract

Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released at https://github.com/apple/ml-gbc and https://huggingface.co/graph-based-captions.

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

TL;DR

This work introduces Graph-Based Captioning (GBC), a graph-structured, text-rich image annotation framework with four node types (image, entity, composition, relation) to capture hierarchical and relational content. It proposes an end-to-end workflow that auto-generates GBC annotations at scale, culminating in two large datasets (GBC1M and GBC10M) built on CC12M and released under CC BY-NC 4.0. Through CLIP training, GBC demonstrates clear advantages over traditional captions, particularly when leveraging composition and relation nodes, and introduces a structure-aware attention mechanism and multi-positive contrastive loss to exploit graph information. The authors also show GBC as middleware for text-to-image generation, enabling finer-grained control by conditioning generation on graph structure, thereby improving alignment with user intent and expanding the practical utility of vision-language models.

Abstract

Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released at https://github.com/apple/ml-gbc and https://huggingface.co/graph-based-captions.
Paper Structure (65 sections, 3 equations, 42 figures, 19 tables)

This paper contains 65 sections, 3 equations, 42 figures, 19 tables.

Figures (42)

  • Figure 1: An illustration of our proposed graph-based captions. The image node, entity nodes, composition nodes, and relation nodes are respectively colored in red, blue, green, and yellow. The color texts in the captions correspond to the labels of the outgoing edges, which are summarized as node labels in the figure. More examples are provided in \ref{['apx:dataset-examples']}.
  • Figure 2: Our image annotation process involves four types of queries that are performed in two separate stages, with the detection model serves to single out the regions that are used for different queries.
  • Figure 3: An example of generated graph from our 200M prompt generation model.
  • Figure 4: Images generated using GBC prompts with different algorithms. Some algorithms use only a strict subset of GBC information. We note that although more advanced methods for generating images from region prompts exist, our goal here is to highlight how incorporating additional graph information can enhance a simple, training-free approach that might otherwise perform poorly when only bounding box information is exploited. Image prompts are provided for the second example using IP adapter ye2023ip-adapter. The method that only leverages prompts and graph does not work for the third example as the depth of the corresponding graph is greater than 1.
  • Figure 5: The system prompt used for image query (first half).
  • ...and 37 more figures