Table of Contents
Fetching ...

Context-Enhanced Memory-Refined Transformer for Online Action Detection

Zhanzhong Pang, Fadime Sener, Angela Yao

TL;DR

The paper addresses the training-inference discrepancy in memory-based online action detection by diagnosing imbalanced short-term context exposure and non-causal leakage from pseudo-future anticipation. It proposes CMeRT, a transformer-based architecture with a Context-Enhanced Encoder and a Memory-Refined Decoder that incorporate near-past and near-future context to stabilize frame representations for both detection and anticipation. Empirical results on THUMOS'14, CrossTask, and EK100 with strong DinoV2 features establish state-of-the-art performance, supported by extensive ablations and runtime analyses. The work highlights the importance of consistent context modeling in OAD and provides new benchmarks and protocols to accelerate progress in the field.

Abstract

Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.

Context-Enhanced Memory-Refined Transformer for Online Action Detection

TL;DR

The paper addresses the training-inference discrepancy in memory-based online action detection by diagnosing imbalanced short-term context exposure and non-causal leakage from pseudo-future anticipation. It proposes CMeRT, a transformer-based architecture with a Context-Enhanced Encoder and a Memory-Refined Decoder that incorporate near-past and near-future context to stabilize frame representations for both detection and anticipation. Empirical results on THUMOS'14, CrossTask, and EK100 with strong DinoV2 features establish state-of-the-art performance, supported by extensive ablations and runtime analyses. The work highlights the importance of consistent context modeling in OAD and provides new benchmarks and protocols to accelerate progress in the field.

Abstract

Online Action Detection (OAD) detects actions in streaming videos using past observations. State-of-the-art OAD approaches model past observations and their interactions with an anticipated future. The past is encoded using short- and long-term memories to capture immediate and long-range dependencies, while anticipation compensates for missing future context. We identify a training-inference discrepancy in existing OAD methods that hinders learning effectiveness. The training uses varying lengths of short-term memory, while inference relies on a full-length short-term memory. As a remedy, we propose a Context-enhanced Memory-Refined Transformer (CMeRT). CMeRT introduces a context-enhanced encoder to improve frame representations using additional near-past context. It also features a memory-refined decoder to leverage near-future generation to enhance performance. CMeRT achieves state-of-the-art in online detection and anticipation on THUMOS'14, CrossTask, and EPIC-Kitchens-100.

Paper Structure

This paper contains 25 sections, 9 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Existing methods exhibit poorly learned frame representations due to imbalanced context exposure and non-causal leakage.
  • Figure 2: Context analysis for short-term frames. $\overline{\ \ast \ }$ for indirectly encoded, $\widehat{\ \ast \ }$ for generated, and the rest for direct context.
  • Figure 3: Frame losses within the short-term memory in MAT wang2023memory at different rounds of accessing anticipated future.
  • Figure 4: Framework of Context-Enhanced Memory-Refined Transformer. The model is in an encoder-decoder formulation, operating on five context partitions: long-term, short-term, anticipation, near-past, and near-future. The Context-Enhanced Encoder compresses the long-term memory $M_L$ and encodes the short-term memory with anticipation as $M_{SA}$ using the compressed long-term $\widehat{M_L}$ and near-past context $M_C$. The Memory-Refined Decoder generates the near-future context $M_F$ from $\widehat{M_L}$ and refines $M_{SA}$ using $M_F$. A weight-shared classifier is adopted to classify both short-term $M_{SA}$ and $\widehat{M_{SA}}$ and near-future $M_F$. All modules are build upon Transformer Decoder Unit.
  • Figure 5: Quality results on THUMOS'14
  • ...and 8 more figures