MS-Temba: Multi-Scale Temporal Mamba for Understanding Long Untrimmed Videos
Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Hieu Le, Srijan Das
TL;DR
MS-Temba introduces a multi-scale temporal Mamba framework that extends state-space models with dilated branches to capture both fine-grained and long-range temporal dynamics in densely labeled untrimmed videos. The architecture combines dilated Temba blocks with a lightweight MS-Fuser and scale-aware auxiliary supervision, enabling precise action boundary localization with only 17M parameters. Empirical results on TSU and Charades show state-of-the-art performance, while ablations validate the importance of multi-scale dilation, projection alignment, and fusion. The approach also transfers to video summarization, achieving top performance on TVSum and SumMe, highlighting its versatility for long-form video understanding.
Abstract
Temporal Action Detection (TAD) in untrimmed videos poses significant challenges, particularly for Activities of Daily Living (ADL) requiring models to (1) process long-duration videos, (2) capture temporal variations in actions, and (3) simultaneously detect dense overlapping actions. Existing CNN and Transformer-based approaches, struggle to jointly capture fine-grained detail and long-range structure at scale. State-space Model (SSM) based Mamba offers powerful long-range modeling, but naive application to TAD collapses fine-grained temporal structure and fails to account for the challenges inherent to TAD. To this end, we propose Multi-Scale Temporal Mamba (MS-Temba), which extends Mamba to TAD with newly introduced dilated SSMs. Each Temba block, comprising dilated SSMs coupled with our proposed additional losses, enables the learning of discriminative representations across temporal scales. A lightweight Multi-scale Mamba Fuser then unifies these multi-scale features via SSM-based aggregation, yielding precise action-boundary localization. With only 17M parameters, MS-Temba achieves state-of-the-art performance on densely labeled ADL benchmarks TSU & Charades, and further generalizes to long-form video summarization, setting new state-of-the-art results on TVSum & SumMe.
