Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi
TL;DR
This work introduces Graph-Based Captioning (GBC), a graph-structured, text-rich image annotation framework with four node types (image, entity, composition, relation) to capture hierarchical and relational content. It proposes an end-to-end workflow that auto-generates GBC annotations at scale, culminating in two large datasets (GBC1M and GBC10M) built on CC12M and released under CC BY-NC 4.0. Through CLIP training, GBC demonstrates clear advantages over traditional captions, particularly when leveraging composition and relation nodes, and introduces a structure-aware attention mechanism and multi-positive contrastive loss to exploit graph information. The authors also show GBC as middleware for text-to-image generation, enabling finer-grained control by conditioning generation on graph structure, thereby improving alignment with user intent and expanding the practical utility of vision-language models.
Abstract
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released at https://github.com/apple/ml-gbc and https://huggingface.co/graph-based-captions.
