Table of Contents
Fetching ...

Gated Temporal Diffusion for Stochastic Long-Term Dense Anticipation

Olga Zatsarynna, Emad Bahrami, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall

TL;DR

This work introduces Gated Temporal Diffusion (GTD) to tackle stochastic long-term dense action anticipation by jointly modeling uncertainty in both observed frames and future predictions. The GTD framework uses a novel Gated Anticipation Network (GTAN) as the generator, which employs gated temporal convolutions to differentiate and fuse information from past observations and future frames within a single diffusion process. Training optimizes a diffusion-based objective, with self-conditioning and stage-wise outputs enabling multiple diverse future sequences; during inference, multiple samples are produced via DDIM sampling. Empirically, GTD achieves state-of-the-art results on Breakfast, Assembly101, and 50Salads in both stochastic and deterministic settings, demonstrating the value of jointly handling observation and future uncertainty for robust long-horizon anticipation. The approach also reveals limitations in inference speed, suggesting avenues like distillation or caching-based optimizations for real-time deployment.

Abstract

Long-term action anticipation has become an important task for many applications such as autonomous driving and human-robot interaction. Unlike short-term anticipation, predicting more actions into the future imposes a real challenge with the increasing uncertainty in longer horizons. While there has been a significant progress in predicting more actions into the future, most of the proposed methods address the task in a deterministic setup and ignore the underlying uncertainty. In this paper, we propose a novel Gated Temporal Diffusion (GTD) network that models the uncertainty of both the observation and the future predictions. As generator, we introduce a Gated Anticipation Network (GTAN) to model both observed and unobserved frames of a video in a mutual representation. On the one hand, using a mutual representation for past and future allows us to jointly model ambiguities in the observation and future, while on the other hand GTAN can by design treat the observed and unobserved parts differently and steer the information flow between them. Our model achieves state-of-the-art results on the Breakfast, Assembly101 and 50Salads datasets in both stochastic and deterministic settings. Code: https://github.com/olga-zats/GTDA .

Gated Temporal Diffusion for Stochastic Long-Term Dense Anticipation

TL;DR

This work introduces Gated Temporal Diffusion (GTD) to tackle stochastic long-term dense action anticipation by jointly modeling uncertainty in both observed frames and future predictions. The GTD framework uses a novel Gated Anticipation Network (GTAN) as the generator, which employs gated temporal convolutions to differentiate and fuse information from past observations and future frames within a single diffusion process. Training optimizes a diffusion-based objective, with self-conditioning and stage-wise outputs enabling multiple diverse future sequences; during inference, multiple samples are produced via DDIM sampling. Empirically, GTD achieves state-of-the-art results on Breakfast, Assembly101, and 50Salads in both stochastic and deterministic settings, demonstrating the value of jointly handling observation and future uncertainty for robust long-horizon anticipation. The approach also reveals limitations in inference speed, suggesting avenues like distillation or caching-based optimizations for real-time deployment.

Abstract

Long-term action anticipation has become an important task for many applications such as autonomous driving and human-robot interaction. Unlike short-term anticipation, predicting more actions into the future imposes a real challenge with the increasing uncertainty in longer horizons. While there has been a significant progress in predicting more actions into the future, most of the proposed methods address the task in a deterministic setup and ignore the underlying uncertainty. In this paper, we propose a novel Gated Temporal Diffusion (GTD) network that models the uncertainty of both the observation and the future predictions. As generator, we introduce a Gated Anticipation Network (GTAN) to model both observed and unobserved frames of a video in a mutual representation. On the one hand, using a mutual representation for past and future allows us to jointly model ambiguities in the observation and future, while on the other hand GTAN can by design treat the observed and unobserved parts differently and steer the information flow between them. Our model achieves state-of-the-art results on the Breakfast, Assembly101 and 50Salads datasets in both stochastic and deterministic settings. Code: https://github.com/olga-zats/GTDA .
Paper Structure (33 sections, 13 equations, 20 figures, 14 tables)

This paper contains 33 sections, 13 equations, 20 figures, 14 tables.

Figures (20)

  • Figure 1: The proposed Gated Temporal Diffusion (GTD) model generates multiple future long-term predictions of actions from a single partially observed video. In contrast to previous works, it models the uncertainty of both the observation and the future. In this example, the light conditions make it difficult to distinguish if a bun or an orange is cut. This ambiguity is reflected in the predicted samples where the uncertainty of the past impacts the predicted future.
  • Figure 1: Visualization of gates from the GTAN. Both outputs are taken from the second stage from layers $l=2$ (left) and $l=8$ (right). The vertical green line marks the boundary between the observed and future frames.
  • Figure 2: We formulate stochastic action anticipation as a diffusion process where the initial input consists of Gaussian noise, $Y_T$, and zero-padded features, $\tilde{F}$. Given the inputs, the GTAN generator predicts the denoised action labels, $\hat{Y}_{0, T}$. From step T-1 to 0, the GTAN generator uses self-conditioning by taking the previous denoised action labels as additional input. The noise $\hat{\epsilon}_t$ and mean $\hat{\mu}_{t}$ terms for steps T-1 to 0 are computed using equations \ref{['eq:eps_t']} and \ref{['eq:mu_t']}. $\oplus$ indicates channel-wise concatenation.
  • Figure 2: MFSS of GTD future predictions for sequences sorted by MFSS diversity for the observation part on Breakfast.
  • Figure 3: The GTAN takes as input a joint representation sequence for observed and future frames. Each stage consists of GTA blocks. The dilated gated convolutions de-activate features at certain frames.
  • ...and 15 more figures