Language-Driven Object-Oriented Two-Stage Method for Scene Graph Anticipation
Xiaomeng Zhu, Changwei Wang, Haozhe Wang, Xinyu Liu, Fangzhen Lin
TL;DR
This work tackles scene graph anticipation by decoupling visual perception from semantic reasoning and leveraging linguistic priors. It introduces Linguistic Scene Graph Anticipation (LSGA) and a two‑stage Object‑Oriented Two‑Stage Method (OOTSM) that first forecasts object sets and then refines object‑centric relations, supported by a temporal transition regularizer. A dedicated Action Genome based benchmark and extensive experiments show that compact fine‑tuned LLMs outperform strong API baselines on text‑based SGA and, when combined with visual SG detectors, deliver substantial gains in long‑horizon video SGA. The proposed framework demonstrates robust performance across noise and detector variations, highlighting the practical value of integrating linguistic reasoning into dynamic scene understanding for surveillance and human–machine collaboration.
Abstract
A scene graph is a structured representation of objects and their spatio-temporal relationships in dynamic scenes. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications in intelligent surveillance and human-machine collaboration. While recent SGA approaches excel at leveraging visual evidence, long-horizon forecasting fundamentally depends on semantic priors and commonsense temporal regularities that are challenging to extract purely from visual features. To explicitly model these semantic dynamics, we propose Linguistic Scene Graph Anticipation (LSGA), a linguistic formulation of SGA that performs temporal relational reasoning over sequences of textualized scene graphs, with visual scene-graph detection handled by a modular front-end when operating on video. Building on this formulation, we introduce Object-Oriented Two-Stage Method (OOTSM), a language-based framework that anticipates object-set dynamics and forecasts object-centric relation trajectories with temporal consistency regularization, and we evaluate it on a dedicated benchmark constructed from Action Genome annotations. Extensive experiments show that compact fine-tuned language models with up to 3B parameters consistently outperform strong zero- and one-shot API baselines, including GPT-4o, GPT-4o-mini, and DeepSeek-V3, under matched textual inputs and context windows. When coupled with off-the-shelf visual scene-graph generators, the resulting multimodal system achieves substantial improvements on video-based SGA, boosting long-horizon mR@50 by up to 21.9\% over strong visual SGA baselines.
