Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning
Deng Li, Bohao Xing, Xin Liu
TL;DR
This work tackles emotion understanding from Micro Gestures (MG) while avoiding biometric data, by leveraging MG clips paired with textual MG labels. It introduces a visual-text contrastive learning framework that maps MG clips and label texts into a shared latent space, aided by Adaptive prompting to fuse visual context, with alignment performed via KL divergence and a projection dimension of $D=256$ and temperature $\tau=0.05$. The approach achieves state-of-the-art MG recognition on the iMiGUE and SMG datasets and shows that using textual MG predictions improves video-level emotion understanding by roughly 2–6%. The authors provide code at https://github.com/linuxsino/Visual-Text-MG, underscoring the practical value of cross-modal cues for affective computing and human–computer interaction.
Abstract
Psychological studies have shown that Micro Gestures (MG) are closely linked to human emotions. MG-based emotion understanding has attracted much attention because it allows for emotion understanding through nonverbal body gestures without relying on identity information (e.g., facial and electrocardiogram data). Therefore, it is essential to recognize MG effectively for advanced emotion understanding. However, existing Micro Gesture Recognition (MGR) methods utilize only a single modality (e.g., RGB or skeleton) while overlooking crucial textual information. In this letter, we propose a simple but effective visual-text contrastive learning solution that utilizes text information for MGR. In addition, instead of using handcrafted prompts for visual-text contrastive learning, we propose a novel module called Adaptive prompting to generate context-aware prompts. The experimental results show that the proposed method achieves state-of-the-art performance on two public datasets. Furthermore, based on an empirical study utilizing the results of MGR for emotion understanding, we demonstrate that using the textual results of MGR significantly improves performance by 6%+ compared to directly using video as input.
