Benchmarking Micro-action Recognition: Dataset, Methods, and Applications
Dan Guo, Kun Li, Bin Hu, Yan Zhang, Meng Wang
TL;DR
Micro-action recognition tackles imperceptible body movements that reveal mental state. The paper introduces MA-52, a large whole-body micro-action dataset with 52 categories, seven body-part labels, 205 participants, and 22,422 videos collected through psychological interviews, and defines MANet, a ResNet-based benchmark augmented with SE and TSM and a joint embedding loss. It benchmarks nine general action recognition models on MA-52 and shows MANet achieves state-of-the-art performance across coarse and fine granularity, supported by ablations and visualization. The work extends to emotion analysis with MA-52-Pro, demonstrating that incorporating micro-action cues improves emotion recognition, and releases both data and code to foster future research.
Abstract
Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at https://github.com/VUT-HFUT/Micro-Action.
