Table of Contents
Fetching ...

AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition

Xuxiong Liu, Tengteng Dong, Fei Wang, Weijie Feng, Xiao Sun

TL;DR

AMMSM addresses the challenge of recognizing subtle micro-expressions by integrating adaptive, self-supervised motion magnification with a Sparse Mamba backbone that selects motion-critical regions. The framework jointly optimizes magnification and sparsity through evolutionary search and end-to-end training, achieving state-of-the-art performance on CASME II and SAMM with strong robustness. Key contributions include the Adaptive Motion Magnification module, the Sparse State Space Duality block, and the adaptive configuration search, all validated via extensive ablations and LOSO benchmarks. This approach offers a scalable, efficient path for high-precision MER in real-world, resource-constrained settings.

Abstract

Micro-expressions are typically regarded as unconscious manifestations of a person's genuine emotions. However, their short duration and subtle signals pose significant challenges for downstream recognition. We propose a multi-task learning framework named the Adaptive Motion Magnification and Sparse Mamba (AMMSM) to address this. This framework aims to enhance the accurate capture of micro-expressions through self-supervised subtle motion magnification, while the sparse spatial selection Mamba architecture combines sparse activation with the advanced Visual Mamba model to model key motion regions and their valuable representations more effectively. Additionally, we employ evolutionary search to optimize the magnification factor and the sparsity ratios of spatial selection, followed by fine-tuning to improve performance further. Extensive experiments on two standard datasets demonstrate that the proposed AMMSM achieves state-of-the-art (SOTA) accuracy and robustness.

AMMSM: Adaptive Motion Magnification and Sparse Mamba for Micro-Expression Recognition

TL;DR

AMMSM addresses the challenge of recognizing subtle micro-expressions by integrating adaptive, self-supervised motion magnification with a Sparse Mamba backbone that selects motion-critical regions. The framework jointly optimizes magnification and sparsity through evolutionary search and end-to-end training, achieving state-of-the-art performance on CASME II and SAMM with strong robustness. Key contributions include the Adaptive Motion Magnification module, the Sparse State Space Duality block, and the adaptive configuration search, all validated via extensive ablations and LOSO benchmarks. This approach offers a scalable, efficient path for high-precision MER in real-world, resource-constrained settings.

Abstract

Micro-expressions are typically regarded as unconscious manifestations of a person's genuine emotions. However, their short duration and subtle signals pose significant challenges for downstream recognition. We propose a multi-task learning framework named the Adaptive Motion Magnification and Sparse Mamba (AMMSM) to address this. This framework aims to enhance the accurate capture of micro-expressions through self-supervised subtle motion magnification, while the sparse spatial selection Mamba architecture combines sparse activation with the advanced Visual Mamba model to model key motion regions and their valuable representations more effectively. Additionally, we employ evolutionary search to optimize the magnification factor and the sparsity ratios of spatial selection, followed by fine-tuning to improve performance further. Extensive experiments on two standard datasets demonstrate that the proposed AMMSM achieves state-of-the-art (SOTA) accuracy and robustness.

Paper Structure

This paper contains 17 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Visualization of magnified optical flow. The colors of the motion maps represent the directions and intensity of the motion. The expressions from top to bottom are happiness, surprise, and disgust. The motion magnification module magnifies movement in regions such as the eyebrows, cheeks, and corners of the mouth while suppressing irrelevant motions for MER.
  • Figure 2: Overall architecture of the AMMSM Model. AMMSM consists of a motion magnification module and a classification module. A standard UNet du2025improving is used as the motion magnifier within the motion magnification module. The classification module follows a two-stream network architecture. In the spatial stream, a ResNet18 he2016deep extracts spatial features from $I_{onset}$. The temporal stream takes $OF_{mag}$ as input, with the backbone structured into four hierarchical stages, and the last layer of the last two blocks is replaced with MSA. The spatiotemporal feature fusion is performed at the end of stage 2.
  • Figure 3: Architecture of SSSD block. The SSSD block comprises a sparse activation module, an SSD block, and an FFN. The sparse activation module enables the SSD block to focus exclusively on the most critical regions of motion, thereby preventing the introduction of irrelevant information. Meanwhile, DWConv and FFN are used separately to enhance the model's ability to capture local information and promote cross-channel information exchange.
  • Figure 4: Distribution of sparsity ratios and magnification factor. The blue dots represent the distribution of sparsity ratios, while the orange dot represents the distribution of the magnification factor.