Table of Contents
Fetching ...

MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation

Olga Zatsarynna, Emad Bahrami, Yazan Abu Farha, Gianpiero Francesca, Juergen Gall

TL;DR

MANTA tackles stochastic long-term dense action anticipation by combining diffusion with a Bidirectional Selective State-Space Layer (BSSL) built on Mamba blocks. This architecture preserves a global receptive field while enabling data-dependent, selective processing of observed and future (masked) frames, resulting in state-of-the-art accuracy and substantial speedups over prior methods like GTDA. The approach delivers strong performance on Breakfast, Assembly101, and 50Salads, with up to 65.3x faster inference and 6.6x faster training, and uses fewer parameters due to its efficient state-space design. The work demonstrates that long-range temporal modelling can be both effective and computationally efficient for real-world, minutes-scale anticipation tasks.

Abstract

Long-term dense action anticipation is very challenging since it requires predicting actions and their durations several minutes into the future based on provided video observations. To model the uncertainty of future outcomes, stochastic models predict several potential future action sequences for the same observation. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency. Our code is available at https://github.com/olga-zats/DIFF_MANTA .

MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation

TL;DR

MANTA tackles stochastic long-term dense action anticipation by combining diffusion with a Bidirectional Selective State-Space Layer (BSSL) built on Mamba blocks. This architecture preserves a global receptive field while enabling data-dependent, selective processing of observed and future (masked) frames, resulting in state-of-the-art accuracy and substantial speedups over prior methods like GTDA. The approach delivers strong performance on Breakfast, Assembly101, and 50Salads, with up to 65.3x faster inference and 6.6x faster training, and uses fewer parameters due to its efficient state-space design. The work demonstrates that long-range temporal modelling can be both effective and computationally efficient for real-world, minutes-scale anticipation tasks.

Abstract

Long-term dense action anticipation is very challenging since it requires predicting actions and their durations several minutes into the future based on provided video observations. To model the uncertainty of future outcomes, stochastic models predict several potential future action sequences for the same observation. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency. Our code is available at https://github.com/olga-zats/DIFF_MANTA .
Paper Structure (23 sections, 10 equations, 10 figures, 11 tables)

This paper contains 23 sections, 10 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: We propose a novel MANTA diffusion generator that allows for more efficient and effective stochastic long-term anticipation compared to the previous work.
  • Figure 2: (Top) Overview of the proposed MANTA model. Given a noise vector $\mathcal{Y}_t$ and a conditioning vector $\mathcal{X}$, constructed by extending the features of the $P$ observed frames with zero padding in place of the $F$ future frames, we concatenate and forward them through our proposed MANTA model. As output, MANTA predicts action classes for both observed and future frames. (Bottom) Illustration of the structure of the (a) MANTA block, (b) Bidirectional State-Space Layer, as well as (c) traversal directions of the BSSL for the temporal input sequence.
  • Figure 3: Qualitative comparison of MANTA and GTDA on Breakfast. Best viewed zoomed in.
  • Figure 4: (Left) Mean time for generating 25 samples measured for different models on Breakfast, with all models performing $50$ inference diffusion steps; (Right) Mean time required for training different models for one epoch on Breakfast. (Both) The batch size is equal across models, and evaluation/training was conducted on the same GPU.
  • Figure 5: Comparison of Average Mean MoC accuracy for different models on videos grouped by their duration on Breakfast for $\alpha=0.3$ and $\beta=0.2$.
  • ...and 5 more figures