Table of Contents
Fetching ...

Enhancing Temporal Action Localization: Advanced S6 Modeling with Recurrent Mechanism

Sangyoun Lee, Juho Jung, Changdae Oh, Sunghee Yun

TL;DR

This work targets Temporal Action Localization by addressing the limitations of conventional sequence models in capturing long-range temporal dependencies and causality. It introduces a Selective State Space Model (S6)-based TAL framework with the Feature Aggregated Bi-S6 (FA-Bi-S6) block, the Dual Bi-S6 structure, and a recurrent mechanism, enabling robust multi-scale spatiotemporal dependency modeling without increasing parameters. The approach demonstrates state-of-the-art performance across THUMOS-14, ActivityNet, FineAction, and HACS, supported by extensive ablations that validate the Stem module design and recurrence strategy. The findings highlight the potential of S6-based architectures to improve TAL by effectively integrating temporal causality and multi-scale context, guiding future exploration of state-space models in video understanding.

Abstract

Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research.

Enhancing Temporal Action Localization: Advanced S6 Modeling with Recurrent Mechanism

TL;DR

This work targets Temporal Action Localization by addressing the limitations of conventional sequence models in capturing long-range temporal dependencies and causality. It introduces a Selective State Space Model (S6)-based TAL framework with the Feature Aggregated Bi-S6 (FA-Bi-S6) block, the Dual Bi-S6 structure, and a recurrent mechanism, enabling robust multi-scale spatiotemporal dependency modeling without increasing parameters. The approach demonstrates state-of-the-art performance across THUMOS-14, ActivityNet, FineAction, and HACS, supported by extensive ablations that validate the Stem module design and recurrence strategy. The findings highlight the potential of S6-based architectures to improve TAL by effectively integrating temporal causality and multi-scale context, guiding future exploration of state-space models in video understanding.

Abstract

Temporal Action Localization (TAL) is a critical task in video analysis, identifying precise start and end times of actions. Existing methods like CNNs, RNNs, GCNs, and Transformers have limitations in capturing long-range dependencies and temporal causality. To address these challenges, we propose a novel TAL architecture leveraging the Selective State Space Model (S6). Our approach integrates the Feature Aggregated Bi-S6 block, Dual Bi-S6 structure, and a recurrent mechanism to enhance temporal and channel-wise dependency modeling without increasing parameter complexity. Extensive experiments on benchmark datasets demonstrate state-of-the-art results with mAP scores of 74.2% on THUMOS-14, 42.9% on ActivityNet, 29.6% on FineAction, and 45.8% on HACS. Ablation studies validate our method's effectiveness, showing that the Dual structure in the Stem module and the recurrent mechanism outperform traditional approaches. Our findings demonstrate the potential of S6-based models in TAL tasks, paving the way for future research.
Paper Structure (25 sections, 7 equations, 4 figures, 2 tables)

This paper contains 25 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the proposed architecture and its components. (a) The architecture overview, which consists of four main parts: Pretrained video encoder, Backbone, Neck, and Heads (Action classification head and Temporal boundary regression head). (b) The overview of the proposed methods, highlighting the Stem module with an orange shaded area. The Stem module consists of three parts: Dual-path processing (Dual Bi-S6 Structure), Feature Aggregation & Temporal/Channel Bi-S6 (Feature Aggregated Bi-S6 Block Design), and the repeat processing with shared networks (Recurrent Mechanism).
  • Figure 2: Diagrams of the Embedding, Stem, and Branch modules. (a) Embedding module. (b) Stem module. (c) Branch module.
  • Figure 3: Diagrams of the Feature aggregated Bi-S6 block design. (a) TFA-Bi-S6 model. (b) CFA-Bi-S6 model. (c) T-Bi-S6 model.
  • Figure 4: Process of extracting spatiotemporal features using the pretrained video encoder. The encoder processes 30fps RGB video frames, groups them into 16-frame clips, and applies patchification, positional embedding, and multi-head self-attention to produce encoded feature vectors.