Table of Contents
Fetching ...

Towards Scene Graph Anticipation

Rohith Peddi, Saksham Singh, Saurabh, Parag Singla, Vibhav Gogate

TL;DR

This paper introduces Scene Graph Anticipation (SGA), a task to forecast future fine-grained object relations in video-based scene graphs. It presents SceneSayer, a continuous-time framework with Object Representation Processing Unit, Spatial Context Processing Unit, and Latent Dynamics Processing Unit to model the evolution of object interactions via NeuralODEs and NeuralSDEs. Through extensive experiments on Action Genome, SceneSayer (especially the SDE variant) yields significant gains in long-horizon relation anticipation across AGS/PGAGS/GAGS settings, with ablations highlighting the benefits of stochastic dynamics modeling and Stratonovich interpretation. The work advances anticipatory scene understanding with potential impact on video surveillance, robotics, and autonomous systems by providing robust, uncertainty-aware relational forecasts beyond 30 seconds into the future.

Abstract

Spatio-temporal scene graphs represent interactions in a video by decomposing scenes into individual objects and their pair-wise temporal relationships. Long-term anticipation of the fine-grained pair-wise relationships between objects is a challenging problem. To this end, we introduce the task of Scene Graph Anticipation (SGA). We adapt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects and propose a novel approach SceneSayer. In SceneSayer, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects. We take a continuous time perspective and model the latent dynamics of the evolution of object interactions using concepts of NeuralODE and NeuralSDE, respectively. We infer representations of future relationships by solving an Ordinary Differential Equation and a Stochastic Differential Equation, respectively. Extensive experimentation on the Action Genome dataset validates the efficacy of the proposed methods.

Towards Scene Graph Anticipation

TL;DR

This paper introduces Scene Graph Anticipation (SGA), a task to forecast future fine-grained object relations in video-based scene graphs. It presents SceneSayer, a continuous-time framework with Object Representation Processing Unit, Spatial Context Processing Unit, and Latent Dynamics Processing Unit to model the evolution of object interactions via NeuralODEs and NeuralSDEs. Through extensive experiments on Action Genome, SceneSayer (especially the SDE variant) yields significant gains in long-horizon relation anticipation across AGS/PGAGS/GAGS settings, with ablations highlighting the benefits of stochastic dynamics modeling and Stratonovich interpretation. The work advances anticipatory scene understanding with potential impact on video surveillance, robotics, and autonomous systems by providing robust, uncertainty-aware relational forecasts beyond 30 seconds into the future.

Abstract

Spatio-temporal scene graphs represent interactions in a video by decomposing scenes into individual objects and their pair-wise temporal relationships. Long-term anticipation of the fine-grained pair-wise relationships between objects is a challenging problem. To this end, we introduce the task of Scene Graph Anticipation (SGA). We adapt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects and propose a novel approach SceneSayer. In SceneSayer, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects. We take a continuous time perspective and model the latent dynamics of the evolution of object interactions using concepts of NeuralODE and NeuralSDE, respectively. We infer representations of future relationships by solving an Ordinary Differential Equation and a Stochastic Differential Equation, respectively. Extensive experimentation on the Action Genome dataset validates the efficacy of the proposed methods.
Paper Structure (50 sections, 26 equations, 7 figures, 7 tables, 4 algorithms)

This paper contains 50 sections, 26 equations, 7 figures, 7 tables, 4 algorithms.

Figures (7)

  • Figure 1: Task Description. We contrast the task of Video Scene Graph Generation (VidSGG) on the left with the proposed task of Scene Graph Anticipation (SGA) on the right. VidSGG entails the identification of relationships from the observed data, such as (Person, looking_at, Floor) and (Person, not_contacting, Cup). SGA aims to anticipate the evolution of these relationships to (Person, touching, Cup), and eventually, (Person, drinking_from, Cup).
  • Figure 2: Overview of SceneSayer. The forward pass of SceneSayer begins with ORPU, where initial object proposals are generated for each frame. These proposals are then fed to a temporal encoder to ensure that the object representations remain consistent over time. Next, in SCPU, we construct initial relationship representations by concatenating the representations of interacting objects. These representations are further refined using a spatial encoder, embedding the scene's spatial context into these relationship representations. Then, the representations undergo further enhancement in LDPU, where another temporal encoder fine-tunes them, imbuing the data with comprehensive spatio-temporal scene knowledge. These refined relationship representations from the final observed frame are then input to a Latent Dynamics Model (LDM), where a generative model, either a NeuralODE or a NeuralSDE, generates relationship representations of interacting objects in future frames by solving the corresponding differential equations. Finally, these future representations are decoded into relationship predicates to construct anticipated scene graphs.
  • Figure 3: Overview of Baselines. In our proposed Variant 1 (shown to the left), we input relationship representations to an anticipatory transformer to generate relationship representations for future frames auto-regressively. A predicate classification network (MLP) is then employed to decode these anticipated relationship representations. Meanwhile, in Variant 2 (shown to the right), we enhance relationship representations by passing them through a temporal encoder. These representations are fed to an auto-regressive anticipatory transformer to anticipate future relationship representations. Here, we employ two predicate classification heads (MLPs): one decodes the observed relationship representations, and the other decodes the anticipated relationship representations. In both variants proposed, the auto-regressive anticipatory transformer acts as the generative model, predicting the evolution of relationships between the interacting objects.
  • Figure 4: Qualitative Results To the left, we show a sampled subset of the frames observed by the models. The second column provides a ground truth scene graph corresponding to a future frame. In the subsequent columns, we contrast the performance of baseline variants with the proposed SceneSayer models. In each graph above, correct anticipations of relationships are denoted with text in black and incorrect anticipation of the relationships are highlighted with text in red.
  • Figure 5: The STTran+ and DSGDetr+ models differ primarily in their ORPU; thereafter, they follow a unified pipeline. Differences in ORPU: (1) In STTran+, the ORPU begins with extracting features using a pre-trained object detector. These features are then processed in the SCPU to construct relationship representations. (2) Conversely, DSGDetr+'s ORPU starts with the initial bounding box and class label detection. Object tracking is established using the Hungarian Matching algorithm, and these tracks are passed through a transformer encoder to construct temporally consistent object representations. Unified Pipeline Post-ORPU: The relationship representations constructed from the concatenation of object representations from each model's unique ORPU are passed through a transformer encoder to aggregate information from the spatial context. This is followed by processing in the LDPU, where we employ an auto-regressive transformer to predict relationship representations of interacting objects in future frames. Finally, these representations are decoded for predicate classification.
  • ...and 2 more figures