Table of Contents
Fetching ...

FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion

Antonio Alliegro, Francesca Pistilli, Tatiana Tommasi, Giuseppe Averta

TL;DR

FORESCENE tackles Scene Graph Anticipation by forecasting complete, dynamic scene graphs that allow objects to appear or disappear over time. It combines a Graph Auto-Encoder to encode observed graphs into a latent space with a Latent Diffusion Model to generate future graph latents, which are decoded back into objects and relations. This approach enables multiple plausible futures and outperforms prior SGA methods on Action Genome, especially under object distribution shifts. The work advances practical human–environment understanding by removing the constraint of object continuity and providing a scalable, diffusion-based framework for structured scene forecasting.

Abstract

Forecasting human-environment interactions in daily activities is challenging due to the high variability of human behavior. While predicting directly from videos is possible, it is limited by confounding factors like irrelevant objects or background noise that do not contribute to the interaction. A promising alternative is using Scene Graphs (SGs) to track only the relevant elements. However, current methods for forecasting future SGs face significant challenges and often rely on unrealistic assumptions, such as fixed objects over time, limiting their applicability to long-term activities where interacted objects may appear or disappear. In this paper, we introduce FORESCENE, a novel framework for Scene Graph Anticipation (SGA) that predicts both object and relationship evolution over time. FORESCENE encodes observed video segments into a latent representation using a tailored Graph Auto-Encoder and forecasts future SGs using a Latent Diffusion Model (LDM). Our approach enables continuous prediction of interaction dynamics without making assumptions on the graph's content or structure. We evaluate FORESCENE on the Action Genome dataset, where it outperforms existing SGA methods while solving a significantly more complex task.

FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion

TL;DR

FORESCENE tackles Scene Graph Anticipation by forecasting complete, dynamic scene graphs that allow objects to appear or disappear over time. It combines a Graph Auto-Encoder to encode observed graphs into a latent space with a Latent Diffusion Model to generate future graph latents, which are decoded back into objects and relations. This approach enables multiple plausible futures and outperforms prior SGA methods on Action Genome, especially under object distribution shifts. The work advances practical human–environment understanding by removing the constraint of object continuity and providing a scalable, diffusion-based framework for structured scene forecasting.

Abstract

Forecasting human-environment interactions in daily activities is challenging due to the high variability of human behavior. While predicting directly from videos is possible, it is limited by confounding factors like irrelevant objects or background noise that do not contribute to the interaction. A promising alternative is using Scene Graphs (SGs) to track only the relevant elements. However, current methods for forecasting future SGs face significant challenges and often rely on unrealistic assumptions, such as fixed objects over time, limiting their applicability to long-term activities where interacted objects may appear or disappear. In this paper, we introduce FORESCENE, a novel framework for Scene Graph Anticipation (SGA) that predicts both object and relationship evolution over time. FORESCENE encodes observed video segments into a latent representation using a tailored Graph Auto-Encoder and forecasts future SGs using a Latent Diffusion Model (LDM). Our approach enables continuous prediction of interaction dynamics without making assumptions on the graph's content or structure. We evaluate FORESCENE on the Action Genome dataset, where it outperforms existing SGA methods while solving a significantly more complex task.

Paper Structure

This paper contains 22 sections, 14 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Scene graphs for human-environment interactions represent actors and objects as nodes, and relationships as edges. Dotted lines indicate predictions, while solid lines denote fixed elements. A model forecasting these interactions should predict the entire graph -- both nodes and edges -- allowing objects to appear and disappear over time. Unlike other methods, which only update relationships while keeping object nodes fixed, FORESCENE achieves complete (nodes+edges) graph forecasting.
  • Figure 2: Overview of the proposed method at inference time to solve the task of Scene Graph Anticipation. Observed frames $\{F_0 ...F_s\}$ are encoded into latent representations using the Graph Encoder. Latent codes are then fed into the diffusion model as conditioning input to anticipate the future unseen latents. The anticipated latents are finally transformed back by the decoder into a sequence of complete (objs + rels) scene graphs $\{F_{s+1} ... F_\text{last}\}$.
  • Figure 3: (Left) Empirical cumulative distribution of the difficulty scores for the top-3 highest-scoring (most difficult in terms of $J_\text{dist}$) splits per test video. To construct the SGA under Object Distribution Shift scenarios, we select splits with difficulty scores in the ranges $[0.33, 0.66)$ (MID) and $[0.66, 1]$ (HARD). (Middle and Right) Distribution of observed video portions corresponding to anticipation splits in the MID and HARD settings.
  • Figure 4: Qualitative comparison of Scene Graph Anticipation between FORESCENE and the top-performing competitor, SceneSayerSDE peddi2024towards, with the observed video portion ($\mathcal{F}$) set to 0.3. The examples highlight a common scenario where the object continuity assumption of previous SGA methods breaks down, hindering their applicability in real-world scenarios. In contrast, FORESCENE accurately forecasts both the appearance and disappearance of objects and their relationships over time.