Table of Contents
Fetching ...

Harnessing Temporal Causality for Advanced Temporal Action Detection

Shuming Liu, Lin Sui, Chen-Lin Zhang, Fangzhou Mu, Chen Zhao, Bernard Ghanem

TL;DR

The paper rethinks temporal action detection by treating action boundary changes as causal events and restricting context to either past or future. It introduces CausalTAD, a hybrid block that fuses causal attention and causal Mamba (SSM) within a one-stage detection framework derived from ActionFormer, enabling strong long-range temporal causality modeling. Across Ego4D Moment Queries, EPIC-Kitchens 100, ActivityNet-1.3, and THUMOS14, the approach achieves state-of-the-art results, with top placements in EgoVis 2024 and EPIC-Kitchens 2024 challenges and comprehensive ablations confirming the benefits of full-sequence causal context and data-scale gains. The work emphasizes offline feature strategies and provides open-source code, offering practical impact for robust long-form video understanding in egocentric and third-person domains.

Abstract

As a fundamental task in long-form video understanding, temporal action detection (TAD) aims to capture inherent temporal relations in untrimmed videos and identify candidate actions with precise boundaries. Over the years, various networks, including convolutions, graphs, and transformers, have been explored for effective temporal modeling for TAD. However, these modules typically treat past and future information equally, overlooking the crucial fact that changes in action boundaries are essentially causal events. Inspired by this insight, we propose leveraging the temporal causality of actions to enhance TAD representation by restricting the model's access to only past or future context. We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on multiple benchmarks. Notably, with CausalTAD, we ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, as well as 1st in the Moment Queries track at the Ego4D Challenge 2024. Our code is available at https://github.com/sming256/OpenTAD/.

Harnessing Temporal Causality for Advanced Temporal Action Detection

TL;DR

The paper rethinks temporal action detection by treating action boundary changes as causal events and restricting context to either past or future. It introduces CausalTAD, a hybrid block that fuses causal attention and causal Mamba (SSM) within a one-stage detection framework derived from ActionFormer, enabling strong long-range temporal causality modeling. Across Ego4D Moment Queries, EPIC-Kitchens 100, ActivityNet-1.3, and THUMOS14, the approach achieves state-of-the-art results, with top placements in EgoVis 2024 and EPIC-Kitchens 2024 challenges and comprehensive ablations confirming the benefits of full-sequence causal context and data-scale gains. The work emphasizes offline feature strategies and provides open-source code, offering practical impact for robust long-form video understanding in egocentric and third-person domains.

Abstract

As a fundamental task in long-form video understanding, temporal action detection (TAD) aims to capture inherent temporal relations in untrimmed videos and identify candidate actions with precise boundaries. Over the years, various networks, including convolutions, graphs, and transformers, have been explored for effective temporal modeling for TAD. However, these modules typically treat past and future information equally, overlooking the crucial fact that changes in action boundaries are essentially causal events. Inspired by this insight, we propose leveraging the temporal causality of actions to enhance TAD representation by restricting the model's access to only past or future context. We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on multiple benchmarks. Notably, with CausalTAD, we ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, as well as 1st in the Moment Queries track at the Ego4D Challenge 2024. Our code is available at https://github.com/sming256/OpenTAD/.
Paper Structure (30 sections, 2 equations, 2 figures, 10 tables)

This paper contains 30 sections, 2 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: (a) Standard temporal modeling treats the past and future context equally, such as convolutions, graphs, and self-attentions, overlooking the fact that changes in action boundaries are essentially causal events. (b) and (c) mitigate this issue by restricting the model's access to only past or future context independently.
  • Figure 2: Hybrid Causal Block. We combine the Multi-Head Self-Attention (MHSA) and Mamba block (SSM) together, and limit their visible temporal context to only past or future tokens, aiming to capture long-range temporal dependencies and causality. The parameters in the forward and backward MHSA and SSM are shared to reduce the overfitting issue in TAD.