Table of Contents
Fetching ...

Language-Driven Object-Oriented Two-Stage Method for Scene Graph Anticipation

Xiaomeng Zhu, Changwei Wang, Haozhe Wang, Xinyu Liu, Fangzhen Lin

TL;DR

This work tackles scene graph anticipation by decoupling visual perception from semantic reasoning and leveraging linguistic priors. It introduces Linguistic Scene Graph Anticipation (LSGA) and a two‑stage Object‑Oriented Two‑Stage Method (OOTSM) that first forecasts object sets and then refines object‑centric relations, supported by a temporal transition regularizer. A dedicated Action Genome based benchmark and extensive experiments show that compact fine‑tuned LLMs outperform strong API baselines on text‑based SGA and, when combined with visual SG detectors, deliver substantial gains in long‑horizon video SGA. The proposed framework demonstrates robust performance across noise and detector variations, highlighting the practical value of integrating linguistic reasoning into dynamic scene understanding for surveillance and human–machine collaboration.

Abstract

A scene graph is a structured representation of objects and their spatio-temporal relationships in dynamic scenes. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications in intelligent surveillance and human-machine collaboration. While recent SGA approaches excel at leveraging visual evidence, long-horizon forecasting fundamentally depends on semantic priors and commonsense temporal regularities that are challenging to extract purely from visual features. To explicitly model these semantic dynamics, we propose Linguistic Scene Graph Anticipation (LSGA), a linguistic formulation of SGA that performs temporal relational reasoning over sequences of textualized scene graphs, with visual scene-graph detection handled by a modular front-end when operating on video. Building on this formulation, we introduce Object-Oriented Two-Stage Method (OOTSM), a language-based framework that anticipates object-set dynamics and forecasts object-centric relation trajectories with temporal consistency regularization, and we evaluate it on a dedicated benchmark constructed from Action Genome annotations. Extensive experiments show that compact fine-tuned language models with up to 3B parameters consistently outperform strong zero- and one-shot API baselines, including GPT-4o, GPT-4o-mini, and DeepSeek-V3, under matched textual inputs and context windows. When coupled with off-the-shelf visual scene-graph generators, the resulting multimodal system achieves substantial improvements on video-based SGA, boosting long-horizon mR@50 by up to 21.9\% over strong visual SGA baselines.

Language-Driven Object-Oriented Two-Stage Method for Scene Graph Anticipation

TL;DR

This work tackles scene graph anticipation by decoupling visual perception from semantic reasoning and leveraging linguistic priors. It introduces Linguistic Scene Graph Anticipation (LSGA) and a two‑stage Object‑Oriented Two‑Stage Method (OOTSM) that first forecasts object sets and then refines object‑centric relations, supported by a temporal transition regularizer. A dedicated Action Genome based benchmark and extensive experiments show that compact fine‑tuned LLMs outperform strong API baselines on text‑based SGA and, when combined with visual SG detectors, deliver substantial gains in long‑horizon video SGA. The proposed framework demonstrates robust performance across noise and detector variations, highlighting the practical value of integrating linguistic reasoning into dynamic scene understanding for surveillance and human–machine collaboration.

Abstract

A scene graph is a structured representation of objects and their spatio-temporal relationships in dynamic scenes. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications in intelligent surveillance and human-machine collaboration. While recent SGA approaches excel at leveraging visual evidence, long-horizon forecasting fundamentally depends on semantic priors and commonsense temporal regularities that are challenging to extract purely from visual features. To explicitly model these semantic dynamics, we propose Linguistic Scene Graph Anticipation (LSGA), a linguistic formulation of SGA that performs temporal relational reasoning over sequences of textualized scene graphs, with visual scene-graph detection handled by a modular front-end when operating on video. Building on this formulation, we introduce Object-Oriented Two-Stage Method (OOTSM), a language-based framework that anticipates object-set dynamics and forecasts object-centric relation trajectories with temporal consistency regularization, and we evaluate it on a dedicated benchmark constructed from Action Genome annotations. Extensive experiments show that compact fine-tuned language models with up to 3B parameters consistently outperform strong zero- and one-shot API baselines, including GPT-4o, GPT-4o-mini, and DeepSeek-V3, under matched textual inputs and context windows. When coupled with off-the-shelf visual scene-graph generators, the resulting multimodal system achieves substantial improvements on video-based SGA, boosting long-horizon mR@50 by up to 21.9\% over strong visual SGA baselines.

Paper Structure

This paper contains 23 sections, 8 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overall ootsm pipeline. The left branch converts observed video frames into compact textual scene graphs; GOA module performs dynamic object anticipation, whereas OORA module carries out object-oriented relationship anticipation. Additionally, scene graphs can bypass GOA for direct input into OORA (dotted arrow), provided the continuous-object constraint is maintained—only objects present in the final observed frame are projected forward. The dashed gray path is optional visual sgg tool integration.
  • Figure 2: GOA training flow. Observed scene graphs with identical structures are first merged and converted to textual descriptions, then combined with instructions to construct prompts. A finetuned LLM subsequently predicts future object sets, supervised by unobserved scene graph targets via temporally weighted cross-entropy loss $L_{\text{GOA}}$.
  • Figure 3: OORA training flow. Observed scene graphs are first grouped by object, then frames with identical scene graph structures are merged. Scene graph descriptions are combined with object-specific information to build prompts. These object-specific prompts guide a finetuned LLM to predict future relationships ($L_{\text{CE}}$). An auxiliary classifier generates per-frame relationship probabilities ($L_{\text{BCE}}$), while a transition regularizer ($L_{\text{trans}}$) ensures temporal coherence by penalizing improbable state transitions.
  • Figure 4: Qualitative results of ootsm. Two representative videos from Action Genome demonstrating newly appearing and disappearing objects in future predictions. Each example shows observed and unobserved frames with goa-predicted future objects and oora-anticipated relations. Green entries denote correct predictions; red entries denote incorrect predictions.
  • Figure 5: Observation window length and weighting hyper-parameter. Left: effect of varying the number of observed frames on recall performance. Right: study of the cosine-weighting coefficient $\beta$. Shaded region denotes one standard deviation over three independent runs per value.