Table of Contents
Fetching ...

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Hieu Le, Srijan Das

TL;DR

MS-Temba introduces a multi-scale temporal Mamba framework that extends state-space models with dilated branches to capture both fine-grained and long-range temporal dynamics in densely labeled untrimmed videos. The architecture combines dilated Temba blocks with a lightweight MS-Fuser and scale-aware auxiliary supervision, enabling precise action boundary localization with only 17M parameters. Empirical results on TSU and Charades show state-of-the-art performance, while ablations validate the importance of multi-scale dilation, projection alignment, and fusion. The approach also transfers to video summarization, achieving top performance on TVSum and SumMe, highlighting its versatility for long-form video understanding.

Abstract

Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.

MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos

TL;DR

MS-Temba introduces a multi-scale temporal Mamba framework that extends state-space models with dilated branches to capture both fine-grained and long-range temporal dynamics in densely labeled untrimmed videos. The architecture combines dilated Temba blocks with a lightweight MS-Fuser and scale-aware auxiliary supervision, enabling precise action boundary localization with only 17M parameters. Empirical results on TSU and Charades show state-of-the-art performance, while ablations validate the importance of multi-scale dilation, projection alignment, and fusion. The approach also transfers to video summarization, achieving top performance on TVSum and SumMe, highlighting its versatility for long-form video understanding.

Abstract

Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.
Paper Structure (20 sections, 16 equations, 11 figures, 11 tables)

This paper contains 20 sections, 16 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Left: Temporal Action Detection poses unique challenges, including the need for models capable of processing long video sequences, the presence of both short and long actions with high intra-class variance, and the complexity of densely overlapping actions. Right: Our proposed method, MS-Temba, achieves state-of-the-art performance while being $\mathbf{5\times}$ more parameter-efficient compared to transformer-based approaches.
  • Figure 2: Comparison between standard and dilated SSM scanning
  • Figure 3: Overall Architecture of Multiscale Temporal Mamba (MS-Temba) for action detection. MS-Temba is composed of a frozen pretrained Visual Backbone, Temporal Mamba (Temba) Blocks for learning representations at multiple temporal scale spaces through dilated SSMs. The Multi-scale Mamba Fuser employs an SSM for effectively fusing the multi-scale features which is projected by the Classification Head for dense action detection.
  • Figure 4: Temba Block 3 with $\eta=3$. Tokens are linearly projected, grouped with stride $\eta$ using $\Phi$, processed by SSMs, aligned with $\mathcal{L}_{\text{cons}}$, and reassembled by $\Phi^{-1}$. A classification head then produces the auxiliary loss $\mathcal{L}_{\text{aux}}$.
  • Figure 5: Impact of dilation in Temba for short and long actions
  • ...and 6 more figures