Table of Contents
Fetching ...

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

Hui Lu, Yi Yu, Shijian Lu, Deepu Rajan, Boon Poh Ng, Alex C. Kot, Xudong Jiang

TL;DR

MambaTAD addresses long-range temporal action detection by merging Diagonal-Masked Bidirectional State-Space modeling with a global feature fusion head, and augments it with a parameter-efficient State-Space Temporal Adapter for end-to-end training. The approach delivers superior, consistent state-of-the-art results across multiple benchmarks (THUMOS14, ActivityNet-1.3, MultiThumos, HACS, FineAction) while reducing parameters and FLOPs relative to prior methods, demonstrating strong robustness to long-span actions and occlusions. Key innovations include DMBSS to mitigate temporal context decay and diagonal conflicts, a multi-scale projection pyramid with global fusion to capture cross-scale information, and SSTA to enable efficient backbone adaptation in end-to-end TAD. Collectively, MambaTAD provides a scalable, accurate, and efficient framework for end-to-end temporal action localization and classification with practical impact on video understanding tasks.

Abstract

Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.

MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection

TL;DR

MambaTAD addresses long-range temporal action detection by merging Diagonal-Masked Bidirectional State-Space modeling with a global feature fusion head, and augments it with a parameter-efficient State-Space Temporal Adapter for end-to-end training. The approach delivers superior, consistent state-of-the-art results across multiple benchmarks (THUMOS14, ActivityNet-1.3, MultiThumos, HACS, FineAction) while reducing parameters and FLOPs relative to prior methods, demonstrating strong robustness to long-span actions and occlusions. Key innovations include DMBSS to mitigate temporal context decay and diagonal conflicts, a multi-scale projection pyramid with global fusion to capture cross-scale information, and SSTA to enable efficient backbone adaptation in end-to-end TAD. Collectively, MambaTAD provides a scalable, accurate, and efficient framework for end-to-end temporal action localization and classification with practical impact on video understanding tasks.

Abstract

Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.

Paper Structure

This paper contains 26 sections, 23 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of TAD methods: Prior studies suffer from decay of temporal information and self-element conflict, which often struggle while facing long-span action instances. The proposed MambaTAD can handle long-span action instances effectively with its long-range modeling and global feature fusion capabilities. The Coverage and Length are two metrics for identifying long action instances according to their proportion with respect to the whole videos ([0.08,1]) and the absolute action length ([18,$\infty$] seconds), respectively. The Average means the normalized average mAP over all action instances of various lengths in the dataset.
  • Figure 2: The overall MambaTAD architecture uses large-scale pre-trained models for the backbone, with a State-Space Temporal Adapter (SSTA) in the end-to-end setting. Pyramid features are processed by Diagonal-Masked Bidirectional State-Space (DMBSS) modules, followed by a global fusion head that progressively concatenates features for global context.
  • Figure 3: (a) The architecture of DMBSS. (b) Forward and backward branches share the same architecture. (c) We mask the diagonal elements in the state transformation matrix $A$ to solve the self-elements conflict.
  • Figure 4: Visualized detection results on the THUMOS14 for action (a) Clean and Jerk; (b) Hammer Throw (occlusion case, best view via zoom-in); (c) Throw Discus; (d) High Jump.
  • Figure 5: Additional visualizations of our results are provided. From top to bottom, each item includes: (1) the input video frames, (2) action scores over time, and (3) a histogram of action onsets and offsets, derived by weighting the regression outputs with the corresponding action scores. The square dot line represents the ground truth start position, while the dashed dot line indicates the ground truth end position. This figure is best viewed in color and when zoomed in.
  • ...and 5 more figures