Table of Contents
Fetching ...

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Zuyao Chen, Jinlin Wu, Zhen Lei, Zhaoxiang Zhang, Changwen Chen

TL;DR

This work tackles the challenge of learning scene graphs from natural language captions by addressing grounding ambiguity and long-tail bias through a divide-and-conquer framework, GPT4SGG. It grounds objects first, then decomposes each image into holistic and region-specific narratives, and uses an LLM to reason about relationships to synthesize comprehensive scene graphs. The approach is validated with two instruction-following datasets, COCO@GPT and VG@GPT, and a private Llama 2 model tuned via LoRA, showing notable improvements on VG150 and meaningful gains over baselines on long-tail metrics. The findings highlight the potential of LLM-driven relation reasoning on textual image representations to produce high-quality SGG supervision from captions, with practical implications for cost-effective open-world scene understanding.

Abstract

Training Scene Graph Generation (SGG) models with natural language captions has become increasingly popular due to the abundant, cost-effective, and open-world generalization supervision signals that natural language offers. However, such unstructured caption data and its processing pose significant challenges in learning accurate and comprehensive scene graphs. The challenges can be summarized as three aspects: 1) traditional scene graph parsers based on linguistic representation often fail to extract meaningful relationship triplets from caption data. 2) grounding unlocalized objects of parsed triplets will meet ambiguity issues in visual-language alignment. 3) caption data typically are sparse and exhibit bias to partial observations of image content. Aiming to address these problems, we propose a divide-and-conquer strategy with a novel framework named \textit{GPT4SGG}, to obtain more accurate and comprehensive scene graph signals. This framework decomposes a complex scene into a bunch of simple regions, resulting in a set of region-specific narratives. With these region-specific narratives (partial observations) and a holistic narrative (global observation) for an image, a large language model (LLM) performs the relationship reasoning to synthesize an accurate and comprehensive scene graph. Experimental results demonstrate \textit{GPT4SGG} significantly improves the performance of SGG models trained on image-caption data, in which the ambiguity issue and long-tail bias have been well-handled with more accurate and comprehensive scene graphs.

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

TL;DR

This work tackles the challenge of learning scene graphs from natural language captions by addressing grounding ambiguity and long-tail bias through a divide-and-conquer framework, GPT4SGG. It grounds objects first, then decomposes each image into holistic and region-specific narratives, and uses an LLM to reason about relationships to synthesize comprehensive scene graphs. The approach is validated with two instruction-following datasets, COCO@GPT and VG@GPT, and a private Llama 2 model tuned via LoRA, showing notable improvements on VG150 and meaningful gains over baselines on long-tail metrics. The findings highlight the potential of LLM-driven relation reasoning on textual image representations to produce high-quality SGG supervision from captions, with practical implications for cost-effective open-world scene understanding.

Abstract

Training Scene Graph Generation (SGG) models with natural language captions has become increasingly popular due to the abundant, cost-effective, and open-world generalization supervision signals that natural language offers. However, such unstructured caption data and its processing pose significant challenges in learning accurate and comprehensive scene graphs. The challenges can be summarized as three aspects: 1) traditional scene graph parsers based on linguistic representation often fail to extract meaningful relationship triplets from caption data. 2) grounding unlocalized objects of parsed triplets will meet ambiguity issues in visual-language alignment. 3) caption data typically are sparse and exhibit bias to partial observations of image content. Aiming to address these problems, we propose a divide-and-conquer strategy with a novel framework named \textit{GPT4SGG}, to obtain more accurate and comprehensive scene graph signals. This framework decomposes a complex scene into a bunch of simple regions, resulting in a set of region-specific narratives. With these region-specific narratives (partial observations) and a holistic narrative (global observation) for an image, a large language model (LLM) performs the relationship reasoning to synthesize an accurate and comprehensive scene graph. Experimental results demonstrate \textit{GPT4SGG} significantly improves the performance of SGG models trained on image-caption data, in which the ambiguity issue and long-tail bias have been well-handled with more accurate and comprehensive scene graphs.
Paper Structure (25 sections, 11 figures, 11 tables, 1 algorithm)

This paper contains 25 sections, 11 figures, 11 tables, 1 algorithm.

Figures (11)

  • Figure 1: Challenges in learning scene graphs from natural language description.
  • Figure 2: Our framework vs. previous pipeline of language-supervised SGG.
  • Figure 3: An overview of our approach GPT4SGG. GPT4SGG decomposes a complex scene into a set of region-specific narratives and a global narrative, and utilizes an LLM to deduce relationships based on the localized objects and narratives.
  • Figure 4: Comparison with manual annotation on VG150 validation set. (a) quantitative results of using different semantic matching strategies; (b) an example of low recall rate based on manual annotation.
  • Figure 5: recall rate under different settings.
  • ...and 6 more figures