Table of Contents
Fetching ...

Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos

Soumyabrata Chaudhuri, Saumik Bhattacharya

TL;DR

This work introduces Simba, the first skeleton action recognition framework to integrate the selective state-space model Mamba for efficient long-sequence temporal modeling on graph-structured skeleton data. The architecture pairs a Down-sampling ShiftGCN Encoder, an Intermediate Mamba Block, an Up-sampling ShiftGCN Decoder, and a ShiftTCN, forming a U-ShiftGCN-based encoder–decoder with four-stream fusion across modalities. Empirical results on NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA show state-of-the-art performance, with ablations confirming the critical role of the IMamba block and the benefits of the proposed down-/up-sampling strategy. The work advances SAR by marrying Mamba’s efficient long-range modeling with graph priors, offering a scalable approach to temporal graph data in video action recognition.

Abstract

Skeleton Action Recognition (SAR) involves identifying human actions using skeletal joint coordinates and their interconnections. While plain Transformers have been attempted for this task, they still fall short compared to the current leading methods, which are rooted in Graph Convolutional Networks (GCNs) due to the absence of structural priors. Recently, a novel selective state space model, Mamba, has surfaced as a compelling alternative to the attention mechanism in Transformers, offering efficient modeling of long sequences. In this work, to the utmost extent of our awareness, we present the first SAR framework incorporating Mamba. Each fundamental block of our model adopts a novel U-ShiftGCN architecture with Mamba as its core component. The encoder segment of the U-ShiftGCN is devised to extract spatial features from the skeletal data using downsampling vanilla Shift S-GCN blocks. These spatial features then undergo intermediate temporal modeling facilitated by the Mamba block before progressing to the encoder section, which comprises vanilla upsampling Shift S-GCN blocks. Additionally, a Shift T-GCN (ShiftTCN) temporal modeling unit is employed before the exit of each fundamental block to refine temporal representations. This particular integration of downsampling spatial, intermediate temporal, upsampling spatial, and ultimate temporal subunits yields promising results for skeleton action recognition. We dub the resulting model \textbf{Simba}, which attains state-of-the-art performance across three well-known benchmark skeleton action recognition datasets: NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA. Interestingly, U-ShiftGCN (Simba without Intermediate Mamba Block) by itself is capable of performing reasonably well and surpasses our baseline.

Simba: Mamba augmented U-ShiftGCN for Skeletal Action Recognition in Videos

TL;DR

This work introduces Simba, the first skeleton action recognition framework to integrate the selective state-space model Mamba for efficient long-sequence temporal modeling on graph-structured skeleton data. The architecture pairs a Down-sampling ShiftGCN Encoder, an Intermediate Mamba Block, an Up-sampling ShiftGCN Decoder, and a ShiftTCN, forming a U-ShiftGCN-based encoder–decoder with four-stream fusion across modalities. Empirical results on NTU RGB+D 60, NTU RGB+D 120, and Northwestern-UCLA show state-of-the-art performance, with ablations confirming the critical role of the IMamba block and the benefits of the proposed down-/up-sampling strategy. The work advances SAR by marrying Mamba’s efficient long-range modeling with graph priors, offering a scalable approach to temporal graph data in video action recognition.

Abstract

Skeleton Action Recognition (SAR) involves identifying human actions using skeletal joint coordinates and their interconnections. While plain Transformers have been attempted for this task, they still fall short compared to the current leading methods, which are rooted in Graph Convolutional Networks (GCNs) due to the absence of structural priors. Recently, a novel selective state space model, Mamba, has surfaced as a compelling alternative to the attention mechanism in Transformers, offering efficient modeling of long sequences. In this work, to the utmost extent of our awareness, we present the first SAR framework incorporating Mamba. Each fundamental block of our model adopts a novel U-ShiftGCN architecture with Mamba as its core component. The encoder segment of the U-ShiftGCN is devised to extract spatial features from the skeletal data using downsampling vanilla Shift S-GCN blocks. These spatial features then undergo intermediate temporal modeling facilitated by the Mamba block before progressing to the encoder section, which comprises vanilla upsampling Shift S-GCN blocks. Additionally, a Shift T-GCN (ShiftTCN) temporal modeling unit is employed before the exit of each fundamental block to refine temporal representations. This particular integration of downsampling spatial, intermediate temporal, upsampling spatial, and ultimate temporal subunits yields promising results for skeleton action recognition. We dub the resulting model \textbf{Simba}, which attains state-of-the-art performance across three well-known benchmark skeleton action recognition datasets: NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA. Interestingly, U-ShiftGCN (Simba without Intermediate Mamba Block) by itself is capable of performing reasonably well and surpasses our baseline.
Paper Structure (19 sections, 10 equations, 1 figure, 7 tables, 1 algorithm)

This paper contains 19 sections, 10 equations, 1 figure, 7 tables, 1 algorithm.

Figures (1)

  • Figure 1: (a) The constituent module of our proposed model: Simba. It is composed of 4 stages or key parts: Down-sampling Shift S-GCN encoder, Intermediate Mamba block, Up-sampling Shift S-GCN decoder and a final Shift T-GCN (ShiftTCN) to enhance the temporal representation. We stack this module serially to obtain our model Simba. Each of these components is explained in details in Sec. \ref{['sec: method']}. The dimensions of the output tensor of each block is written at the top of their respective block. (b) Intermediate-Mamba (I-Mamba) block. The SSM, here, is primarily responsible for efficiently modeling long sequences like pose snapshots of videos over a given window size. This block lies at the heart of our architecture and its functionality is elaborated in subsection \ref{['subsec:mamba']}.