Table of Contents
Fetching ...

B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition

Nishit Poddar, Aglind Reka, Diana-Laura Borza, Snehashis Majhi, Michal Balazia, Abhijit Das, Francois Bremond

Abstract

Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.

B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition

Abstract

Micro-actions, fleeting and low-amplitude motions, such as glances, nods, or minor posture shifts, carry rich social meaning but remain difficult for current action recognition models to recognize due to their subtlety, short duration, and high inter-class ambiguity. In this paper, we introduce B-MoE, a Body-part-aware Mixture-of-Experts framework designed to explicitly model the structured nature of human motion. In B-MoE, each expert specializes in a distinct body region (head, body, upper limbs, lower limbs), and is based on the lightweight Macro-Micro Motion Encoder (M3E) that captures long-range contextual structure and fine-grained local motion. A cross-attention routing mechanism learns inter-region relationships and dynamically selects the most informative regions for each micro-action. B-MoE uses a dual-stream encoder that fuses these region-specific semantic cues with global motion features to jointly capture spatially localized cues and temporally subtle variations that characterize micro-actions. Experiments on three challenging benchmarks (MA-52, SocialGesture, and MPII-GroupInteraction) show consistent state-of-theart gains, with improvements in ambiguous, underrepresented, and low amplitude classes.
Paper Structure (20 sections, 4 equations, 6 figures, 4 tables)

This paper contains 20 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Visualization of MA categories illustrating fine-grained motion variations and the challenges of class imbalance and inter-category ambiguity. Example actions (E1: Scratching or touching chest, C11: Pointing oneself) that appear visually similar but belongs to different semantic categories guo2024benchmarking.
  • Figure 2: Semantic Branch: Using SAPIENS khirodkar2024sapiens, we segment each frame, derive the crop around the target body part (upper limb in this example), and apply the corresponding mask to the cropped region. The resulting cropped and masked video is then processed by VideoMAE-V2, pretrained on Kinetics.
  • Figure 3: B-MoE: A dual-stream encoder extracts region-conditioned semantic features using semantic encoder and global motion encoder. The semantic stream is routed through a region-aware MoE, where each expert specializes in modeling micro-movements within a specific body region. A cross-attention fusion head integrates expert outputs with motion saliency from the global stream, and a transformer-MLP classifier produces the final predictions.
  • Figure 4: Macro-Micro Motion Encoder (M3E). The input sequence is processed with multi-head self-attention to capture global temporal dependencies, followed by an SGP module shi2023tridet for fine-grained local motion reasoning. During pre-training, a semantic alignment loss ($\mathcal{L}_{\text{emb}}$) aligns learned features with word embeddings of action labels.
  • Figure 5: Cross-attention heatmap showing expert importance per micro-action class on MA-52. Brighter colors indicate higher expert activation.
  • ...and 1 more figures