Table of Contents
Fetching ...

LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study

Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo

TL;DR

The paper presents Text-Scene Graph (TSG) Bench, a benchmark designed to rigorously evaluate LLMs on both understanding and generating scene graphs from narratives. Through 11 LLMs and a large, human-curated dataset built from ego-centric action graphs, the study finds that LLMs excel at understanding tasks (SGQA/SGDS) but struggle with generation tasks (SA-SGG/MA-SGG), with multiple-action decomposition posing the largest challenge. It further analyzes prompting strategies (CoT and in-context learning), error refinement, and hallucination tendencies to identify mechanisms for improving performance. The work provides an open dataset, demos, and code to promote future research in multimodal grounding and structured reasoning. Overall, the findings highlight a clear gap between understanding and generating structured scene representations and point to concrete avenues for methodological improvements and multimodal integration.

Abstract

The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.

LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study

TL;DR

The paper presents Text-Scene Graph (TSG) Bench, a benchmark designed to rigorously evaluate LLMs on both understanding and generating scene graphs from narratives. Through 11 LLMs and a large, human-curated dataset built from ego-centric action graphs, the study finds that LLMs excel at understanding tasks (SGQA/SGDS) but struggle with generation tasks (SA-SGG/MA-SGG), with multiple-action decomposition posing the largest challenge. It further analyzes prompting strategies (CoT and in-context learning), error refinement, and hallucination tendencies to identify mechanisms for improving performance. The work provides an open dataset, demos, and code to promote future research in multimodal grounding and structured reasoning. Overall, the findings highlight a clear gap between understanding and generating structured scene representations and point to concrete avenues for methodological improvements and multimodal integration.

Abstract

The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.

Paper Structure

This paper contains 46 sections, 22 figures, 9 tables.

Figures (22)

  • Figure 1: An illustration showing how a scene graph can represent the objects and their relationships in a scene. The illustration was created with the assistance of DALLE-3.
  • Figure 2: Overview of TSG Bench. Scene graph question answering aims to answer a question by reasoning over scene graphs. Scene graph description selection is a multiple-choice task of selecting the correct description of a given scene graph. Single action scene graph generation focuses on generating a scene graph based on a description representing a single action. Multiple action scene graph generation aims to generate multiple discrete scene graphs of all actions represented in the description.
  • Figure 3: Domain distribution across scenarios.
  • Figure 4: F1-score results on decomposed scene graph generation tasks, distinguishing between single-action and multiple-action settings. ES (Edge Single) and NS (Node Single) evaluate edge and node generation performance in SA-SGG, respectively. EM (Edge Multiple) and NM (Node Multiple) assess edge and node generation in MA-SGG. ACT (Action) measures the model’s performance in action decomposition in MA-SGG.
  • Figure 5: The results of the comparative evaluation under four conditions—Explicit, Implicit, Repetition, and All. The “All” condition comprises the entire dataset, while the other three focus on subsets featuring explicit actions, implicit actions, or repeated actions.
  • ...and 17 more figures