Table of Contents
Fetching ...

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Rongjie Li, Songyang Zhang, Dahua Lin, Kai Chen, Xuming He

TL;DR

This work tackles open-vocabulary scene graph generation by recasting SGG as an image-to-sequence task using a generative vision-language model, enabling novel predicate concepts through scene graph prompts and a relation-aware grounding pipeline. The framework, PGSG, comprises scene graph sequence generation, an Entity Grounding Module, and a Category Conversion Module, with a learning objective that unifies token-level sequence modeling and spatial grounding, plus an inference strategy that yields diverse yet precise relation triplets. It further demonstrates how explicit scene-graph knowledge can be transferred to downstream vision-language tasks via fine-tuning, improving VL reasoning tasks such as visual grounding, VQA, and image captioning. Across PSG, VG, and OpenImages, PGSG achieves state-of-the-art open-vocabulary SGG results and provides consistent improvements to VL tasks, highlighting the practical value of integrating open-set relational modeling with VL pipelines. The approach offers a scalable, end-to-end solution that unifies SGG with broader VL reasoning while signaling future work to enhance close-vocabulary performance and extend to additional VL models.

Abstract

Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

TL;DR

This work tackles open-vocabulary scene graph generation by recasting SGG as an image-to-sequence task using a generative vision-language model, enabling novel predicate concepts through scene graph prompts and a relation-aware grounding pipeline. The framework, PGSG, comprises scene graph sequence generation, an Entity Grounding Module, and a Category Conversion Module, with a learning objective that unifies token-level sequence modeling and spatial grounding, plus an inference strategy that yields diverse yet precise relation triplets. It further demonstrates how explicit scene-graph knowledge can be transferred to downstream vision-language tasks via fine-tuning, improving VL reasoning tasks such as visual grounding, VQA, and image captioning. Across PSG, VG, and OpenImages, PGSG achieves state-of-the-art open-vocabulary SGG results and provides consistent improvements to VL tasks, highlighting the practical value of integrating open-set relational modeling with VL pipelines. The approach offers a scalable, end-to-end solution that unifies SGG with broader VL reasoning while signaling future work to enhance close-vocabulary performance and extend to additional VL models.

Abstract

Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.
Paper Structure (33 sections, 4 equations, 4 figures, 11 tables)

This paper contains 33 sections, 4 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: An illustration of open-vocabulary SGG paradigm comparison. (A) previous work adopt the task-specific VLM as predicate classifiers from given entity proposals; (B) Our framework offers a unified framework for generating scene graph with novel predicates from images directly and conducting VL tasks.
  • Figure 2: Illustration of overall pipeline of our PGSG. We generate scene graph sequences from the images using VLM. Then, the relation construction module grounds the entities and converts categorical labels from the sequence. For VL tasks, the SGG training provides parameters as initialization for VLM in fine-tuning.
  • Figure 3: Illustration of entity grounding module and category conversion module. Left): The entity grounding module localizes the entities within scene graph sequences by predicting their bounding boxes. Right): The category conversion module maps the vocabulary sequence prediction into categorical prediction.
  • Figure 4: The visualization of scene graph sequence prediction of PGSG.