Context-Enhanced Memory-Refined Transformer for Online Action Detection
Zhanzhong Pang, Fadime Sener, Angela Yao
TL;DR
The paper addresses the training-inference discrepancy in memory-based online action detection by diagnosing imbalanced short-term context exposure and non-causal leakage from pseudo-future anticipation. It proposes CMeRT, a transformer-based architecture with a Context-Enhanced Encoder and a Memory-Refined Decoder that incorporate near-past and near-future context to stabilize frame representations for both detection and anticipation. Empirical results on THUMOS'14, CrossTask, and EK100 with strong DinoV2 features establish state-of-the-art performance, supported by extensive ablations and runtime analyses. The work highlights the importance of consistent context modeling in OAD and provides new benchmarks and protocols to accelerate progress in the field.
Abstract
Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.
