Table of Contents
Fetching ...

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate

Abstract

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Abstract

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.
Paper Structure (168 sections, 73 equations, 10 figures, 16 tables, 2 algorithms)

This paper contains 168 sections, 73 equations, 10 figures, 16 tables, 2 algorithms.

Figures (10)

  • Figure 1: World Scene Graph Generation (WSGG). Unlike standard Video Scene Graph Generation (left) that is constrained to the instantaneous camera view and discards objects once they exit the frame or get occluded, our proposed task (right) grounds scene understanding in a global 3D world coordinate frame. Models trained on WSGG output a comprehensive world scene graph at each timestamp containing all the objects in the environment. As shown at $t=30$s, out-of-view objects (e.g., bed, laptop) remain perfectly localized in 3D, enabling global, view-independent interpretable scene reasoning. Blue curves drawn from person to objects (right) represent relationships.
  • Figure 2: WorldSGG Dataset & 3D OBB construction pipeline. Our offline framework generates persistent 3D annotations through three sequential steps: (a) Scene Construction recovers the global scene point cloud and camera poses using $\pi^3$ and bundle adjustment; (b) Floor Determination establishes a canonical ground plane by aligning 3D human meshes extracted via PromptHMR; and (c) 3D OBB Construction refines raw object geometries to fit robust 3D bounding boxes in the world frame.
  • Figure 3: WorldSGG Methods. Multi-modal inputs including DINO visual features, monocular 3D geometries, and camera extrinsics are processed into unified structural, motion, and ego-pose tokens. (Right & Bottom) Unobserved Object Processing: To maintain strict object permanence for entities outside the current camera frustum, we propose and evaluate three structural variants: (1) PWG (Persistent World Graph) utilizes a Last-Known-State (LKS) buffer to explicitly track and propagate historical features based on temporal staleness ($\Delta_n$). (2) MWAE (Masked World Auto-Encoder) frames natural occlusion as a masked modeling task; it masks unobserved visual streams and uses an Associative Retriever with asymmetric cross-attention (querying all tokens against only visible ones) to reconstruct missing features of unobserved objects. (3) 4DST (4D Scene Transformer) fuses multi-modal tokens at a Fusion Node and applies unmasked bidirectional temporal self-attention followed by Spatial GNNs to output dense, globally-aware spatiotemporal representations ($\mathbf{H}^{(t)}$).
  • Figure 4: The Graph RAG inference pipeline. (Left) Sequential video segments are processed by a VLM into local captions to construct a global Coarse Event Graph, establishing a high-level temporal narrative. (Right) To generate the fine-grained world scene graph, our Graph RAG module retrieves overarching spatiotemporal context from the coarse graph. This global prior is injected into both the visual stream (Vision Encoder) and the textual stream, where it enriches targeted Object and Relationship queries. An LLM then deduces the final, offline World Scene Graph.
  • Figure 5: World Scene Graph Generation (WSGG).Left: Standard Video Scene Graph Generation (VidSGG) is anchored to the instantaneous camera view. Objects are localized as 2D bounding boxes in the image plane, and relationships are predicted only for currently detected entities. When an object exits the field of view or becomes occluded, it is silently dropped from the graph along with all of its relationships, resulting in an incomplete and temporally fragmented scene understanding. Right: Our proposed WSGG task grounds scene understanding in a persistent, global 3D world coordinate frame. Every object in the environment is represented by a 3D oriented bounding box (OBB) that persists across frames, regardless of whether the object is currently visible in the camera view. At each timestamp, the model outputs a complete world scene graph containing all known objects; observed ($\mathcal{O}^{t}$) and unobserved ($\mathcal{U}^{t}$); together with their pairwise semantic relationships (attention, spatial, and contacting predicates). As illustrated at $t{=}30$s, objects that have left the camera's field of view (e.g., bed, laptop) remain precisely localized in 3D world coordinates, enabling the model to continue predicting meaningful relationships for them. Blue curves drawn from person to objects denote predicted relationships. This view-independent, temporally persistent representation supports downstream tasks such as embodied navigation, robotic manipulation, and long-horizon activity understanding that require reasoning about the full state of the world, not just its currently visible slice.
  • ...and 5 more figures