Table of Contents
Fetching ...

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

Sakib Reza, Yuexi Zhang, Mohsen Moghaddam, Octavia Camps

TL;DR

Online Temporal Action Localization (OnTAL) requires real-time instance-level predictions from streaming videos, but existing methods underutilize long-term historical context. The History-Augmented Anchor Transformer (HAT) integrates a History module (compressor, action anticipation head, future-driven refinement) with a History-Augmented Anchor Module (encoder/decoder with anchors) to produce refined anchor features for online prediction, aided by an Adaptive Focal Loss to handle class imbalance. Key contributions include the first long-term history–aware anchor-based transformer for OnTAL, a history-compression mechanism guided by action anticipation, a future-context alignment step, and comprehensive evaluations showing strong gains on procedural egocentric datasets (PREGO) with competitive performance on standard OnTAL benchmarks. Overall, the work demonstrates the practical impact of long-term history for online action localization, particularly in procedural and egocentric settings, and points to future work on dynamic history lengths and adaptive attention.

Abstract

Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: https://github.com/sakibreza/ECCV24-HAT/

HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization

TL;DR

Online Temporal Action Localization (OnTAL) requires real-time instance-level predictions from streaming videos, but existing methods underutilize long-term historical context. The History-Augmented Anchor Transformer (HAT) integrates a History module (compressor, action anticipation head, future-driven refinement) with a History-Augmented Anchor Module (encoder/decoder with anchors) to produce refined anchor features for online prediction, aided by an Adaptive Focal Loss to handle class imbalance. Key contributions include the first long-term history–aware anchor-based transformer for OnTAL, a history-compression mechanism guided by action anticipation, a future-context alignment step, and comprehensive evaluations showing strong gains on procedural egocentric datasets (PREGO) with competitive performance on standard OnTAL benchmarks. Overall, the work demonstrates the practical impact of long-term history for online action localization, particularly in procedural and egocentric settings, and points to future work on dynamic history lengths and adaptive attention.

Abstract

Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: https://github.com/sakibreza/ECCV24-HAT/
Paper Structure (41 sections, 8 equations, 5 figures, 8 tables)

This paper contains 41 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: High-level architecture designs of prior approaches and our method.
  • Figure 2: Proposed History-Augmented Anchor Transformer (HAT) architecture.
  • Figure 3: Comparison of qualitative results of HAT with OAT kim2022sliding. Blue is the ground truth, red is the result of OAT, and green is the result of HAT.
  • Figure 4: Frame attention analysis for an "cutting" a tomato example from EGTEA dataset. The frames with related actions such as "taking" the tomato from the freezer, "taking" a knife, "putting" the tomato on a plate and the initial "cutting" action receive more attention (upper row).
  • Figure 5: Impact of history length comparison