Understanding Emotional Body Expressions via Large Language Models
Haifeng Lu, Jiuyi Chen, Feng Liang, Mingkui Tan, Runhao Zeng, Xiping Hu
TL;DR
This work tackles emotion recognition from 3D body movements and the lack of textual explanations by introducing EAI-LLM, which converts skeleton data into LLM-friendly tokens via a Multi-Granularity Skeleton Tokenizer and a Unified Skeleton Token module. It couples skeleton encoding with a skeleton-aware LLM using LoRA, and aligns skeleton and language spaces through contrastive learning and CLIP-based text encoding to enable both accurate classification and descriptive generation. The approach is pre-trained on skeleton-language data and fine-tuned on prompt-based QA tasks, achieving competitive emotion recognition across Emilya, KDAE, and EGBM while enabling rich emotion descriptions, even with limited labeled data. This framework advances explainable, cross-dataset emotion understanding in HCI by leveraging LLMs to produce text explanations grounded in skeletal movement patterns, with potential for broader multimodal and cross-domain applications.
Abstract
Emotion recognition based on body movements is vital in human-computer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by Large Language Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.
