Table of Contents
Fetching ...

Temporal Context Consistency Above All: Enhancing Long-Term Anticipation by Learning and Enforcing Temporal Constraints

Alberto Maté, Mariella Dimiccoli

TL;DR

This work tackles long-term action anticipation from an untrimmed video by predicting a sequence of future actions and their durations. It introduces TCCA, a transformer-based encoder-decoder that leverages a LTContext encoder for strong past-context understanding, a parallel query-based decoder, and two novel components: the Bi-Directional Action Context Regularizer (BACR) and a CRF-based global temporal sequence optimizer. BACR enforces local coherence by aligning predicted past and future actions with neighboring segments, while the CRF with a learnable transition matrix models plausible action transitions and enables global sequence optimization via a Viterbi-like inference. Across four benchmarks—Breakfast, 50Salads, EpicKitchens-55, and EGTEA+—TCCA achieves state-of-the-art or highly competitive performance, often surpassing LLM-based and probabilistic baselines that rely on trimmed inputs and lack explicit duration modeling. The results demonstrate that incorporating local temporal coherence and global transition constraints yields more accurate and temporally consistent long-term forecasts, with practical implications for planning and real-time decision-making in dynamic environments.

Abstract

This paper proposes a method for long-term action anticipation (LTA), the task of predicting action labels and their duration in a video given the observation of an initial untrimmed video interval. We build on an encoder-decoder architecture with parallel decoding and make two key contributions. First, we introduce a bi-directional action context regularizer module on the top of the decoder that ensures temporal context coherence in temporally adjacent segments. Second, we learn from classified segments a transition matrix that models the probability of transitioning from one action to another and the sequence is optimized globally over the full prediction interval. In addition, we use a specialized encoder for the task of action segmentation to increase the quality of the predictions in the observation interval at inference time, leading to a better understanding of the past. We validate our methods on four benchmark datasets for LTA, the EpicKitchen-55, EGTEA+, 50Salads and Breakfast demonstrating superior or comparable performance to state-of-the-art methods, including probabilistic models and also those based on Large Language Models, that assume trimmed video as input. The code will be released upon acceptance.

Temporal Context Consistency Above All: Enhancing Long-Term Anticipation by Learning and Enforcing Temporal Constraints

TL;DR

This work tackles long-term action anticipation from an untrimmed video by predicting a sequence of future actions and their durations. It introduces TCCA, a transformer-based encoder-decoder that leverages a LTContext encoder for strong past-context understanding, a parallel query-based decoder, and two novel components: the Bi-Directional Action Context Regularizer (BACR) and a CRF-based global temporal sequence optimizer. BACR enforces local coherence by aligning predicted past and future actions with neighboring segments, while the CRF with a learnable transition matrix models plausible action transitions and enables global sequence optimization via a Viterbi-like inference. Across four benchmarks—Breakfast, 50Salads, EpicKitchens-55, and EGTEA+—TCCA achieves state-of-the-art or highly competitive performance, often surpassing LLM-based and probabilistic baselines that rely on trimmed inputs and lack explicit duration modeling. The results demonstrate that incorporating local temporal coherence and global transition constraints yields more accurate and temporally consistent long-term forecasts, with practical implications for planning and real-time decision-making in dynamic environments.

Abstract

This paper proposes a method for long-term action anticipation (LTA), the task of predicting action labels and their duration in a video given the observation of an initial untrimmed video interval. We build on an encoder-decoder architecture with parallel decoding and make two key contributions. First, we introduce a bi-directional action context regularizer module on the top of the decoder that ensures temporal context coherence in temporally adjacent segments. Second, we learn from classified segments a transition matrix that models the probability of transitioning from one action to another and the sequence is optimized globally over the full prediction interval. In addition, we use a specialized encoder for the task of action segmentation to increase the quality of the predictions in the observation interval at inference time, leading to a better understanding of the past. We validate our methods on four benchmark datasets for LTA, the EpicKitchen-55, EGTEA+, 50Salads and Breakfast demonstrating superior or comparable performance to state-of-the-art methods, including probabilistic models and also those based on Large Language Models, that assume trimmed video as input. The code will be released upon acceptance.
Paper Structure (36 sections, 12 equations, 7 figures, 10 tables)

This paper contains 36 sections, 12 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of our method. Given the initial untrimmed portion of the video, LTA predicts future actions and their durations. Using the 'fried egg' activity from the Breakfast dataset breakfast: (a) shows the results of FUTR; (b) shows our method, TCCA, with an enhanced Temporal Action Segmentation (TAS) encoder and a temporal-consistent decoder, ensuring better results.
  • Figure 2: Diagram of the TCCA Architecture. Our method builds on an encoder-decoder structure. The LTContext encoder with temporal smoothing loss processes the observed portion of the video, transforming initial features $\bm{F}$ into logits $\bm{F_{\text{seg}}}$ representing action segmentation logits across $\bm{S}$ stages. These logits are used in the query-based decoder with cross-attention to queries $\bm{Q}$. The queries $\bm{Q^\prime}$ are classified by the Classifier & BACR module and further processed by the CRF Layer to predict the future action sequence, utilizing a transition matrix $\bm{M}$ that is learnt through training. The right side details the Classifier, BACR and CRF, where $\bm{Q'}$ is projected into current, previous ($\bm{a_{past}^\prime}$), and next ($\bm{a_{fut}^i}$) action logits. These last two are trained with a KL loss between previous ($\bm{a_{pres}^{i-1}}$) and next ($\bm{a_{pres}^{i+1}}$) action logits, respectively. The duration of the $i-th$ segment is partially dependent on $\bm{a^i_{pres}}$ and $Q_i^\prime$.
  • Figure 3: Qualitative results. We display the ground-truth (GT), the results of TCCA (Ours) and FUTR FUTR after using their official checkpoints. The left side of the diagram displays action segmentation from the observation, while the right side shows action anticipation after decoding the action and duration into a frame-wise sequence.
  • Figure 4: Impact of transition matrix initialization. We initialized $\bm{M}$ using a pre-computed matrix (top-left) based on statistical analysis, and a random matrix (bottom-left). The learned transition matrix from a pre-computed matrix (top-right) and random matrix (bottom-right) are remarkably similar.
  • Figure 5: Impact of Observation Horizon ($\alpha$) and Segmentation on anticipation in Breakfast. Results are shown for $\alpha \in \{0.1, 0.2, 0.3, 0.4, 0.5\}$. Segmentation is measured with Acc and Anticipation with MoC at $\beta=0.5$. Points indicate the mean; shaded areas represent the standard deviation across 4 splits.
  • ...and 2 more figures