Table of Contents
Fetching ...

Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

Jihao Gu, Kun Li, Fei Wang, Yanyan Wei, Zhiliang Wu, Hehe Fan, Meng Wang

TL;DR

This paper tackles the challenge of recognizing micro-actions from skeletal data by explicitly modeling subtle motion cues. It introduces the Motion-guided Modulation Network (MMN), which decomposes motion into skeletal-level modulation (MSM) and temporal-level modulation (MTM), and couples them with a motion consistency learning pipeline to fuse multi-scale features. The approach leverages skeleton-aware embeddings, skeleton-temporal positional encoding, and skeletal-temporal formers to produce discriminative spatiotemporal representations. Experiments on MA-52 and iMiGUE demonstrate state-of-the-art performance with strong efficiency, validating the importance of explicit, motion-guided modulation for micro-action recognition.

Abstract

Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.

Motion Matters: Motion-guided Modulation Network for Skeleton-based Micro-Action Recognition

TL;DR

This paper tackles the challenge of recognizing micro-actions from skeletal data by explicitly modeling subtle motion cues. It introduces the Motion-guided Modulation Network (MMN), which decomposes motion into skeletal-level modulation (MSM) and temporal-level modulation (MTM), and couples them with a motion consistency learning pipeline to fuse multi-scale features. The approach leverages skeleton-aware embeddings, skeleton-temporal positional encoding, and skeletal-temporal formers to produce discriminative spatiotemporal representations. Experiments on MA-52 and iMiGUE demonstrate state-of-the-art performance with strong efficiency, validating the importance of explicit, motion-guided modulation for micro-action recognition.

Abstract

Micro-Actions (MAs) are an important form of non-verbal communication in social interactions, with potential applications in human emotional analysis. However, existing methods in Micro-Action Recognition often overlook the inherent subtle changes in MAs, which limits the accuracy of distinguishing MAs with subtle changes. To address this issue, we present a novel Motion-guided Modulation Network (MMN) that implicitly captures and modulates subtle motion cues to enhance spatial-temporal representation learning. Specifically, we introduce a Motion-guided Skeletal Modulation module (MSM) to inject motion cues at the skeletal level, acting as a control signal to guide spatial representation modeling. In parallel, we design a Motion-guided Temporal Modulation module (MTM) to incorporate motion information at the frame level, facilitating the modeling of holistic motion patterns in micro-actions. Finally, we propose a motion consistency learning strategy to aggregate the motion cues from multi-scale features for micro-action classification. Experimental results on the Micro-Action 52 and iMiGUE datasets demonstrate that MMN achieves state-of-the-art performance in skeleton-based micro-action recognition, underscoring the importance of explicitly modeling subtle motion cues. The code will be available at https://github.com/momiji-bit/MMN.

Paper Structure

This paper contains 16 sections, 18 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Micro-Action Recognition (MAR) aims to recognize micro-action with subtle motion amplitudes, high inter-class similarity, and notable intra-class variability caused by individual differences. (b) The Pipeline of the proposed Motion-guided Modulation Network (MMN). We attempt to dynamically modulate the motion information into skeletal and temporal separately, fascinating spatial-temporal skeleton feature representation.
  • Figure 2: Overview of the proposed Motion-guided Modulation Network (MMN). It mainly consists of three key modules: Feature Embedding (§ \ref{['sec:embedding']}), Motion-guided Feature Modulation (§ \ref{['sec:mfb']}), and Motion Consistency Learning (§ \ref{['sec:mcl']}). We first project the skeleton data into a high-dimensional feature space ${\bm{X}}_{proj}$. Then, the motion-guided feature modulation module is designed to inject motion patterns into the spatial and temporal dimensions separately, guiding the model to mine crucial cues of micro-actions. Finally, the motion consistency learning module aggregates multi-scale motion cues for action classification.
  • Figure 3: Qualitative results on the Micro-Action 52 dataset guo2024benchmarking. LEFT: Motion-modulated feature ${\bm{X}}_{agg}$ from $N$ stacked Motion-guided Skeletal-Temporal Formers (§\ref{['sec:mfb']}). For the micro-action of "turning head" the model gradually focuses on discriminative joints, i.e., the facial joints highlighted by the red dashed box. RIGHT: Case study on different micro-actions.
  • Figure 4: Confusion Matrix of the test set on the Micro-Action 52 dataset. Please Zoom in for details.