Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

Deng Li; Bohao Xing; Xin Liu

Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

Deng Li, Bohao Xing, Xin Liu

TL;DR

This work tackles emotion understanding from Micro Gestures (MG) while avoiding biometric data, by leveraging MG clips paired with textual MG labels. It introduces a visual-text contrastive learning framework that maps MG clips and label texts into a shared latent space, aided by Adaptive prompting to fuse visual context, with alignment performed via KL divergence and a projection dimension of $D=256$ and temperature $\tau=0.05$. The approach achieves state-of-the-art MG recognition on the iMiGUE and SMG datasets and shows that using textual MG predictions improves video-level emotion understanding by roughly 2–6%. The authors provide code at https://github.com/linuxsino/Visual-Text-MG, underscoring the practical value of cross-modal cues for affective computing and human–computer interaction.

Abstract

Psychological studies have shown that Micro Gestures (MG) are closely linked to human emotions. MG-based emotion understanding has attracted much attention because it allows for emotion understanding through nonverbal body gestures without relying on identity information (e.g., facial and electrocardiogram data). Therefore, it is essential to recognize MG effectively for advanced emotion understanding. However, existing Micro Gesture Recognition (MGR) methods utilize only a single modality (e.g., RGB or skeleton) while overlooking crucial textual information. In this letter, we propose a simple but effective visual-text contrastive learning solution that utilizes text information for MGR. In addition, instead of using handcrafted prompts for visual-text contrastive learning, we propose a novel module called Adaptive prompting to generate context-aware prompts. The experimental results show that the proposed method achieves state-of-the-art performance on two public datasets. Furthermore, based on an empirical study utilizing the results of MGR for emotion understanding, we demonstrate that using the textual results of MGR significantly improves performance by 6%+ compared to directly using video as input.

Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

TL;DR

and temperature

. The approach achieves state-of-the-art MG recognition on the iMiGUE and SMG datasets and shows that using textual MG predictions improves video-level emotion understanding by roughly 2–6%. The authors provide code at https://github.com/linuxsino/Visual-Text-MG, underscoring the practical value of cross-modal cues for affective computing and human–computer interaction.

Abstract

Paper Structure (9 sections, 5 equations, 4 figures, 5 tables)

This paper contains 9 sections, 5 equations, 4 figures, 5 tables.

Introduction
Proposed method
Micro gesture recognition
Emotion understanding
Experimental result
Experimental setting
Ablation study
Comparative study
Conclusion

Figures (4)

Figure 1: Some micro gestures (e.g., "touching jaw" and "touching or scratching neck") exhibit similarities in visual feature space, as depicted in (c), but they differ in text feature space, as illustrated in (d). Is it enough to rely solely on visual cues to recognize micro gestures? How can we leverage text information for micro gesture recognition? Best viewed digitally in color and zoomed-in.
Figure 2: Comparison of different emotion understanding solutions: The proposed method initially recognizes micro gestures through visual-text contrastive learning. It subsequently utilizes the textual prediction results from micro gesture recognition for emotion understanding.
Figure 3: Structure of the clip-level micro gesture recognition. We employ a pre-trained video encoder to encode micro gesture clips as video representation $\overline{v}_i$ and a pre-trained text encoder to encode pre-defined micro gesture labels as video representation $\overline{t}_i$. The proposed Adaptive prompting integrates the contextual information from $\overline{v}_i$ with $\overline{t}_i$ and then generates $\hat{t}_i$. We adopt Kullback-Leibler divergence loss KL($\cdot$) to optimize the similarity score.
Figure 4: Comparison of the resulting confusion matrix for MGR. 'W/o text info.' denotes the baseline model, and 'w/ text info.' denotes the proposed visual-text model. The X-axis represents the model's predicted class, while the Y-axis represents the actual class.

Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

TL;DR

Abstract

Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)