Table of Contents
Fetching ...

Action-Guided Attention for Video Action Anticipation

Tsung-Ming Tai, Sofia Casarin, Andrea Pilzer, Werner Nutt, Oswald Lanz

TL;DR

Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling, is proposed, offering transparent and interpretable insights into its anticipative predictions.

Abstract

Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.

Action-Guided Attention for Video Action Anticipation

TL;DR

Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling, is proposed, offering transparent and interpretable insights into its anticipative predictions.

Abstract

Anticipating future actions in videos is challenging, as the observed frames provide only evidence of past activities, requiring the inference of latent intentions to predict upcoming actions. Existing transformer-based approaches, which rely on dot-product attention over pixel representations, often lack the high-level semantics necessary to model video sequences for effective action anticipation. As a result, these methods tend to overfit to explicit visual cues present in the past frames, limiting their ability to capture underlying intentions and degrading generalization to unseen samples. To address this, we propose Action-Guided Attention (AGA), an attention mechanism that explicitly leverages predicted action sequences as queries and keys to guide sequence modeling. Our approach fosters the attention module to emphasize relevant moments from the past based on the upcoming activity and combine this information with the current frame embedding via a dedicated gating function. The design of AGA enables post-training analysis of the knowledge discovered from the training set. Experiments on the widely adopted EPIC-Kitchens-100 benchmark demonstrate that AGA generalizes well from validation to unseen test sets. Post-training analysis can further examine the action dependencies captured by the model and the counterfactual evidence it has internalized, offering transparent and interpretable insights into its anticipative predictions.
Paper Structure (29 sections, 12 equations, 11 figures, 13 tables)

This paper contains 29 sections, 12 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Architecture Overview. The model consists of two modules. The Action-Guided Attention uses the most recent $S$ action predictions as keys, the exponential moving average (EMA) of all predicted actions as the query, and $S$ frame embeddings as values to generate a history context $\tilde{h}_t$. The Adaptive Gating then integrates this history context with the current frame embedding $e_t$ to produce a fused representation, which is mapped to the new prediction $\hat{y}_t$.
  • Figure 2: Forward Analysis identifies which past actions the model attends to when predicting its next action in response to a query. This analysis was conducted on the model trained with EK100.
  • Figure 3: Backward Analysis. The figure compares the top-5 original predicted actions with the counterfactual supportive actions optimized toward the target action take pan. Each column represents a timestep; the final column shows the anticipated output. Suppressed actions are highlighted in red and promoted actions appear in green. This example is drawn from the EK100 validation set.
  • Figure 4: Visualization of Adaptive Gating Ratio. The gating values, displayed alongside the action sequence, demonstrate context-aware behavior. In background regions (black), the gate retains historical context; in action regions (colored), it prioritizes current visual evidence.
  • Figure 5: Robustness analysis of AGA against the error occurred in frame prediction.
  • ...and 6 more figures