Table of Contents
Fetching ...

Semantically Guided Representation Learning For Action Anticipation

Anxhelo Diko, Danilo Avola, Bardh Prenkaj, Federico Fontana, Luigi Cinque

TL;DR

This work tackles action anticipation under future uncertainty by introducing S-GEAR, a framework that learns visual action prototypes and encodes semantic interconnections derived from language. It combines a ViT-based visual encoder, a Temporal Context Aggregator, a Prototype Attention module, and a causal Transformer decoder, guided by language-derived prototypes $ ho_ ext{ℓ}$ mapped into a common space with visual prototypes $ ho_ u$ via relative representations. Training optimizes a multi-term objective $ ext{L}_{tot} = \\lambda_1 ext{L}_{Sem} + \\lambda_2 ext{L}_{Cls} + \\lambda_3 ext{L}_{Past} + \\lambda_4 ext{L}_{Feat}$, supplemented by a regularization $ ext{L}_{reg}$ to align visual prototypes with their corresponding actions. Empirically, S-GEAR yields substantial gains on EK55, EK100, EGTEA Gaze+, and 50 Salads, demonstrating that transferring language-driven geometric relationships to visual prototypes enhances anticipation and opens new avenues for semantically informed forecasting, while acknowledging limitations such as lack of explicit action-order modeling and proposing future work to address them.

Abstract

Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. To gather insights on S-GEAR's effectiveness, we test it on four action anticipation benchmarks, obtaining improved results compared to previous works: +3.5, +2.7, and +3.5 absolute points on Top-1 Accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, S-GEAR opens new research frontiers in anticipation tasks by demonstrating the intricate impact of action semantic interconnectivity.

Semantically Guided Representation Learning For Action Anticipation

TL;DR

This work tackles action anticipation under future uncertainty by introducing S-GEAR, a framework that learns visual action prototypes and encodes semantic interconnections derived from language. It combines a ViT-based visual encoder, a Temporal Context Aggregator, a Prototype Attention module, and a causal Transformer decoder, guided by language-derived prototypes mapped into a common space with visual prototypes via relative representations. Training optimizes a multi-term objective , supplemented by a regularization to align visual prototypes with their corresponding actions. Empirically, S-GEAR yields substantial gains on EK55, EK100, EGTEA Gaze+, and 50 Salads, demonstrating that transferring language-driven geometric relationships to visual prototypes enhances anticipation and opens new avenues for semantically informed forecasting, while acknowledging limitations such as lack of explicit action-order modeling and proposing future work to address them.

Abstract

Action anticipation is the task of forecasting future activity from a partially observed sequence of events. However, this task is exposed to intrinsic future uncertainty and the difficulty of reasoning upon interconnected actions. Unlike previous works that focus on extrapolating better visual and temporal information, we concentrate on learning action representations that are aware of their semantic interconnectivity based on prototypical action patterns and contextual co-occurrences. To this end, we propose the novel Semantically Guided Representation Learning (S-GEAR) framework. S-GEAR learns visual action prototypes and leverages language models to structure their relationship, inducing semanticity. To gather insights on S-GEAR's effectiveness, we test it on four action anticipation benchmarks, obtaining improved results compared to previous works: +3.5, +2.7, and +3.5 absolute points on Top-1 Accuracy on Epic-Kitchen 55, EGTEA Gaze+ and 50 Salads, respectively, and +0.8 on Top-5 Recall on Epic-Kitchens 100. We further observe that S-GEAR effectively transfers the geometric associations between actions from language to visual prototypes. Finally, S-GEAR opens new research frontiers in anticipation tasks by demonstrating the intricate impact of action semantic interconnectivity.
Paper Structure (26 sections, 15 equations, 11 figures, 8 tables)

This paper contains 26 sections, 15 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We propose learning action prototypes that encode typical action representations and meaningful semantic interconnections. The model leverages these prototypes to enhance the network encodings of observed actions and to forecast upcoming ones.
  • Figure 2: S-GEAR processes frame sequence patches and creates input token sequences $S_t$. ViT $\phi$ encodes $S_t$ into intermediate features $I_t$. PA $\gamma$ and TCA $\varphi$ process $I_t$, merging outputs into semantically enhanced causal features $\hat{I}_t$. Class tokens pass through the CT decoder $\Omega$, predicting future features $z_{t}$. The features $z_t$ and the proposed prototypes are trained for action anticipation ($\mathcal{L}_{Cls}$) and semantic relation encodings ($\mathcal{L}_{Sem}$). The network is also regularized for accurate future representations ($\mathcal{L}_{Feat}$) and correct past action classification ($\mathcal{L}_{Past}$). Finally, a distance loss ($\mathcal{L}_{reg}$) is applied to $z_t$.
  • Figure 3: Ablation on language encoders.
  • Figure 4: Performance according to used prototype ratio.
  • Figure 5: EK55 (top) and EG (bottom) Top-5 Acc. for variable $\tau_a$.
  • ...and 6 more figures