Table of Contents
Fetching ...

MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition

Deng Li, Jun Shao, Bohao Xing, Rong Gao, Bihan Wen, Heikki Kälviäinen, Xin Liu

TL;DR

This work tackles micro-gesture recognition by enhancing an efficient state-space model (Mamba) with motion-aware local spatiotemporal modeling. It introduces a central frame difference–based Multiscale CFD State Fusion Module (MCFM) and an Adaptive Scale Weighting Module (ASWM), yielding MSF-Mamba, with MSF-Mamba+ adding extra multiscale branches and dynamic fusion. By combining bidirectional SSM for global context with CFD-driven local cues, the approach achieves state-of-the-art results on iMiGUE and SMG while maintaining linear-time inference $O(n)$. The findings demonstrate that injecting motion-aware local information into a lightweight, scalable sequence model provides substantial accuracy gains with favorable efficiency, enabling practical, real-time MGR deployment.

Abstract

Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition

TL;DR

This work tackles micro-gesture recognition by enhancing an efficient state-space model (Mamba) with motion-aware local spatiotemporal modeling. It introduces a central frame difference–based Multiscale CFD State Fusion Module (MCFM) and an Adaptive Scale Weighting Module (ASWM), yielding MSF-Mamba, with MSF-Mamba+ adding extra multiscale branches and dynamic fusion. By combining bidirectional SSM for global context with CFD-driven local cues, the approach achieves state-of-the-art results on iMiGUE and SMG while maintaining linear-time inference . The findings demonstrate that injecting motion-aware local information into a lightweight, scalable sequence model provides substantial accuracy gains with favorable efficiency, enabling practical, real-time MGR deployment.

Abstract

Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.

Paper Structure

This paper contains 18 sections, 15 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Comparison between the previous Mamba-based visual models and the proposed MSF-Mamba in MGR. MSF-Mamba (c) enhances the original Mamba architecture by introducing multiscale motion-aware state fusion to aggregate local contextual states, and an adaptive scale weighting mechanism that dynamically weighs fused states at different scales. The red arrow denotes the interaction of state fusion.
  • Figure 2: Overview of the proposed MSF-Mamba framework for micro gesture recognition (MGR). Given an input video sequence $V_i \in \mathbb{R}^{T \times H \times W \times 3}$, patch embedding generates token sequence $Z \in \mathbb{R}^{n \times d}$. A Bidirectional SSM module to generate hidden states $H \in \mathbb{R}^{n \times d}$. Multiscale central frame difference state fusion module (MCFM), which applies central temporal difference (CTD)-based state fusion over multiple window sizes to produce $H_{\text{MCFM}} \in \mathbb{R}^{M \times d \times T \times H' \times W'}$. The adaptive scale weighting module (ASWM) then adaptively aggregates these multiscale fused states using attention weights $\alpha \in \mathbb{R}^{3 \times T \times H' \times W'}$. The final feature map is linearly projected for final prediction.
  • Figure 3: Qualitative visualization of activation map from the output of the multiscale central frame difference state fusion module (MCFM) across various micro gesture categories. Each subfigure group shows the raw input frame (left), the activation map of MCFM without central frame difference (CFD) (middle), and the activation map of MCFM with CFD (right).
  • Figure 4: Visualization of learned attention weights of ASWM over different fusion scales ($win@3\times3\times3$, $win@5\times5\times5$, $win@7\times7\times7$) for different micro-gestures.
  • Figure 5: Comparison between MSF-Mamba, MSF-Mamba$^{+}$, and VideoMamba at different model scales on the SMG dataset. The x-axis indicates average inference time per video (in seconds), and the y-axis shows Top-1 accuracy.
  • ...and 3 more figures