MSF-Mamba: Motion-aware State Fusion Mamba for Efficient Micro-Gesture Recognition
Deng Li, Jun Shao, Bohao Xing, Rong Gao, Bihan Wen, Heikki Kälviäinen, Xin Liu
TL;DR
This work tackles micro-gesture recognition by enhancing an efficient state-space model (Mamba) with motion-aware local spatiotemporal modeling. It introduces a central frame difference–based Multiscale CFD State Fusion Module (MCFM) and an Adaptive Scale Weighting Module (ASWM), yielding MSF-Mamba, with MSF-Mamba+ adding extra multiscale branches and dynamic fusion. By combining bidirectional SSM for global context with CFD-driven local cues, the approach achieves state-of-the-art results on iMiGUE and SMG while maintaining linear-time inference $O(n)$. The findings demonstrate that injecting motion-aware local information into a lightweight, scalable sequence model provides substantial accuracy gains with favorable efficiency, enabling practical, real-time MGR deployment.
Abstract
Micro-gesture recognition (MGR) targets the identification of subtle and fine-grained human motions and requires accurate modeling of both long-range and local spatiotemporal dependencies. While CNNs are effective at capturing local patterns, they struggle with long-range dependencies due to their limited receptive fields. Transformer-based models address this limitation through self-attention mechanisms but suffer from high computational costs. Recently, Mamba has shown promise as an efficient model, leveraging state space models (SSMs) to enable linear-time processing However, directly applying the vanilla Mamba to MGR may not be optimal. This is because Mamba processes inputs as 1D sequences, with state updates relying solely on the previous state, and thus lacks the ability to model local spatiotemporal dependencies. In addition, previous methods lack a design of motion-awareness, which is crucial in MGR. To overcome these limitations, we propose motion-aware state fusion mamba (MSF-Mamba), which enhances Mamba with local spatiotemporal modeling by fusing local contextual neighboring states. Our design introduces a motion-aware state fusion module based on central frame difference (CFD). Furthermore, a multiscale version named MSF-Mamba+ has been proposed. Specifically, MSF-Mamba supports multiscale motion-aware state fusion, as well as an adaptive scale weighting module that dynamically weighs the fused states across different scales. These enhancements explicitly address the limitations of vanilla Mamba by enabling motion-aware local spatiotemporal modeling, allowing MSF-Mamba and MSF-Mamba to effectively capture subtle motion cues for MGR. Experiments on two public MGR datasets demonstrate that even the lightweight version, namely, MSF-Mamba, achieves SoTA performance, outperforming existing CNN-, Transformer-, and SSM-based models while maintaining high efficiency.
