Table of Contents
Fetching ...

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Mingjie Xu, Mengyang Wu, Yuzhi Zhao, Jason Chun Lok Li, Weifeng Ou

TL;DR

This work tackles open-vocabulary scene graph generation (SGG) and the underexplored area of 3D spatial relations by introducing LLaVA-SpaceSGG, a multimodal model built on SpaceSGG. SpaceSGG blends 2D scene graphs with depth-based 3D cues to produce rich spatial descriptions, QA, and multi-turn conversations, and is paired with a two-stage instruction-tuning regime to transfer SGG capabilities to an MLLM. The approach achieves state-of-the-art results on the Panoptic Scene Graph (PSG) dataset (boosting Recall by $8.6\%$ and mRecall by $28.4\%$) and shows superior spatial understanding on a dedicated spatial relation validation set ($3.8\%$ accuracy). These contributions enable more robust open-vocabulary SGG and sharper spatial reasoning for downstream tasks requiring nuanced scene understanding, with code, data, and models publicly available.

Abstract

Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs' inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that LLaVA-SpaceSGG outperforms other open-vocabulary SGG methods, boosting recall by 8.6% and mean recall by 28.4% compared to the baseline. Our codebase, dataset, and trained models are publicly accessible on GitHub at the following URL: https://github.com/Endlinc/LLaVA-SpaceSGG.

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

TL;DR

This work tackles open-vocabulary scene graph generation (SGG) and the underexplored area of 3D spatial relations by introducing LLaVA-SpaceSGG, a multimodal model built on SpaceSGG. SpaceSGG blends 2D scene graphs with depth-based 3D cues to produce rich spatial descriptions, QA, and multi-turn conversations, and is paired with a two-stage instruction-tuning regime to transfer SGG capabilities to an MLLM. The approach achieves state-of-the-art results on the Panoptic Scene Graph (PSG) dataset (boosting Recall by and mRecall by ) and shows superior spatial understanding on a dedicated spatial relation validation set ( accuracy). These contributions enable more robust open-vocabulary SGG and sharper spatial reasoning for downstream tasks requiring nuanced scene understanding, with code, data, and models publicly available.

Abstract

Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs' inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that LLaVA-SpaceSGG outperforms other open-vocabulary SGG methods, boosting recall by 8.6% and mean recall by 28.4% compared to the baseline. Our codebase, dataset, and trained models are publicly accessible on GitHub at the following URL: https://github.com/Endlinc/LLaVA-SpaceSGG.

Paper Structure

This paper contains 25 sections, 2 equations, 12 figures, 6 tables, 2 algorithms.

Figures (12)

  • Figure 1: The illustration of different tasks: (a) Object Detection, (b) Scene Graph Generation (SGG), and (c) Scene Graph Generation (SGG) with enhanced spatial relations. By additionally leveraging spatial relationships, we propose the LLaVA-SpaceSGG framework.
  • Figure 2: SpaceSGG dataset construction pipeline. We utilize both SGG description and spatial relationships, where we generate 3 types of data: spatial scene detailed descriptions (SpaceSGG-Desc), QA (SpaceSGG-QA), and multi-turn conversations (SpaceSGG-Conv).
  • Figure 3: 3D Information Extraction: We retrieve the spatial layering distribution of the input images with the assistance of object detectors and depth estimator.
  • Figure 4: An example of SpaceSGG-Desc, SpaceSGG-QA, and SpaceSGG-Conv generation process.
  • Figure 5: Our proposed training paradigm and used training dataset.
  • ...and 7 more figures