Table of Contents
Fetching ...

Benchmarking Micro-action Recognition: Dataset, Methods, and Applications

Dan Guo, Kun Li, Bin Hu, Yan Zhang, Meng Wang

TL;DR

Micro-action recognition tackles imperceptible body movements that reveal mental state. The paper introduces MA-52, a large whole-body micro-action dataset with 52 categories, seven body-part labels, 205 participants, and 22,422 videos collected through psychological interviews, and defines MANet, a ResNet-based benchmark augmented with SE and TSM and a joint embedding loss. It benchmarks nine general action recognition models on MA-52 and shows MANet achieves state-of-the-art performance across coarse and fine granularity, supported by ablations and visualization. The work extends to emotion analysis with MA-52-Pro, demonstrating that incorporating micro-action cues improves emotion recognition, and releases both data and code to foster future research.

Abstract

Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at https://github.com/VUT-HFUT/Micro-Action.

Benchmarking Micro-action Recognition: Dataset, Methods, and Applications

TL;DR

Micro-action recognition tackles imperceptible body movements that reveal mental state. The paper introduces MA-52, a large whole-body micro-action dataset with 52 categories, seven body-part labels, 205 participants, and 22,422 videos collected through psychological interviews, and defines MANet, a ResNet-based benchmark augmented with SE and TSM and a joint embedding loss. It benchmarks nine general action recognition models on MA-52 and shows MANet achieves state-of-the-art performance across coarse and fine granularity, supported by ablations and visualization. The work extends to emotion analysis with MA-52-Pro, demonstrating that incorporating micro-action cues improves emotion recognition, and releases both data and code to foster future research.

Abstract

Micro-action is an imperceptible non-verbal behaviour characterised by low-intensity movement. It offers insights into the feelings and intentions of individuals and is important for human-oriented applications such as emotion recognition and psychological assessment. However, the identification, differentiation, and understanding of micro-actions pose challenges due to the imperceptible and inaccessible nature of these subtle human behaviors in everyday life. In this study, we innovatively collect a new micro-action dataset designated as Micro-action-52 (MA-52), and propose a benchmark named micro-action network (MANet) for micro-action recognition (MAR) task. Uniquely, MA-52 provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews. Based on the proposed dataset, we assess MANet and other nine prevalent action recognition methods. MANet incorporates squeeze-and excitation (SE) and temporal shift module (TSM) into the ResNet architecture for modeling the spatiotemporal characteristics of micro-actions. Then a joint-embedding loss is designed for semantic matching between video and action labels; the loss is used to better distinguish between visually similar yet distinct micro-action categories. The extended application in emotion recognition has demonstrated one of the important values of our proposed dataset and method. In the future, further exploration of human behaviour, emotion, and psychological assessment will be conducted in depth. The dataset and source code are released at https://github.com/VUT-HFUT/Micro-Action.
Paper Structure (22 sections, 7 equations, 11 figures, 7 tables)

This paper contains 22 sections, 7 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Data collection procedure and samples of our micro-action dataset. We conduct professional face-to-face psychological interviews to collect whole-body micro-actions, focusing on two levels of prediction i.e., body part and micro-action category. Psychological interviews are conducted under the guidance of the Symptom Checklist 90 (SCL90) test (also called the self-report inventory) derogatis2004scl to elicit and collect natural spontaneous micro-actions.
  • Figure 2: (a) Details of body part (coarse-grained) and micro-action (fine-grained) labels. (b) Data distribution over the gender and age of respondents, data distribution over the micro-action categories, and the proportion of training/validation/test sets. (c) Detailed category distribution of video samples in the MA-52 dataset over the micro-action labels.
  • Figure 3: The architecture of the Micro-action Network (MANet). The core architecture of the MANet integrates the squeeze-and-excitation (SE) hu2018squeeze and temporal shift module (TSM)lin2019tsm into the ResNet-50 framework. The SE specializes in channel-wise feature enhancement on the spatial feature map, whereas the TSM elevates the temporal modeling by swapping the channels of adjacent frame representations. To semantically align the action label with the video, a semantic embedding loss between the action label and the video feature is used to supervise their semantic alignment through joint feature embedding, denoted as $\mathcal{L}_{emb}$. The model predicts the fine-grained action label of the video under the supervision of the classification loss $\mathcal{L}_{cls}$ and the embedding loss $\mathcal{L}_{emb}$.
  • Figure 4: The prediction results of six examples for micro-action recognition examples on the MA-52 dataset are displayed, and the line graphs show the probability distributions with both coarse- and fine-grained labels. The MANet model demonstrates robust performance at both coarse- and fine-grained levels. For instance, Figure \ref{['fig:action_visual2']} (e) reveals that MANet accurately predicts the micro-actions "hands touching fingers" in conjunction with the body part interaction of "upper limb." Conversely, the TSM model misclassifies the coarse-grained label as "lower limb" and incorrectly identifies the fine-grained label as "crossing legs" in this specific case. The task of distinguishing between highly similar micro-actions remains a pressing challenge.
  • Figure 5: t-SNE van2008visualizing results of coarse- and fine-grained features on the test set of Micro-action-52 dataset. Each point indicates a video instance and different colors indicate various micro-action categories. Compared with TSM lin2019tsm, MANet shows a clear clustering effect on both coarse- and fine-grained micro-action categories.
  • ...and 6 more figures