HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization
Sakib Reza, Yuexi Zhang, Mohsen Moghaddam, Octavia Camps
TL;DR
Online Temporal Action Localization (OnTAL) requires real-time instance-level predictions from streaming videos, but existing methods underutilize long-term historical context. The History-Augmented Anchor Transformer (HAT) integrates a History module (compressor, action anticipation head, future-driven refinement) with a History-Augmented Anchor Module (encoder/decoder with anchors) to produce refined anchor features for online prediction, aided by an Adaptive Focal Loss to handle class imbalance. Key contributions include the first long-term history–aware anchor-based transformer for OnTAL, a history-compression mechanism guided by action anticipation, a future-context alignment step, and comprehensive evaluations showing strong gains on procedural egocentric datasets (PREGO) with competitive performance on standard OnTAL benchmarks. Overall, the work demonstrates the practical impact of long-term history for online action localization, particularly in procedural and egocentric settings, and points to future work on dynamic history lengths and adaptive attention.
Abstract
Online video understanding often relies on individual frames, leading to frame-by-frame predictions. Recent advancements such as Online Temporal Action Localization (OnTAL), extend this approach to instance-level predictions. However, existing methods mainly focus on short-term context, neglecting historical information. To address this, we introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL. By integrating historical context, our framework enhances the synergy between long-term and short-term information, improving the quality of anchor features crucial for classification and localization. We evaluate our model on both procedural egocentric (PREGO) datasets (EGTEA and EPIC) and standard non-PREGO OnTAL datasets (THUMOS and MUSES). Results show that our model outperforms state-of-the-art approaches significantly on PREGO datasets and achieves comparable or slightly superior performance on non-PREGO datasets, underscoring the importance of leveraging long-term history, especially in procedural and egocentric action scenarios. Code is available at: https://github.com/sakibreza/ECCV24-HAT/
