Table of Contents
Fetching ...

Understanding Emotional Body Expressions via Large Language Models

Haifeng Lu, Jiuyi Chen, Feng Liang, Mingkui Tan, Runhao Zeng, Xiping Hu

TL;DR

This work tackles emotion recognition from 3D body movements and the lack of textual explanations by introducing EAI-LLM, which converts skeleton data into LLM-friendly tokens via a Multi-Granularity Skeleton Tokenizer and a Unified Skeleton Token module. It couples skeleton encoding with a skeleton-aware LLM using LoRA, and aligns skeleton and language spaces through contrastive learning and CLIP-based text encoding to enable both accurate classification and descriptive generation. The approach is pre-trained on skeleton-language data and fine-tuned on prompt-based QA tasks, achieving competitive emotion recognition across Emilya, KDAE, and EGBM while enabling rich emotion descriptions, even with limited labeled data. This framework advances explainable, cross-dataset emotion understanding in HCI by leveraging LLMs to produce text explanations grounded in skeletal movement patterns, with potential for broader multimodal and cross-domain applications.

Abstract

Emotion recognition based on body movements is vital in human-computer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by Large Language Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.

Understanding Emotional Body Expressions via Large Language Models

TL;DR

This work tackles emotion recognition from 3D body movements and the lack of textual explanations by introducing EAI-LLM, which converts skeleton data into LLM-friendly tokens via a Multi-Granularity Skeleton Tokenizer and a Unified Skeleton Token module. It couples skeleton encoding with a skeleton-aware LLM using LoRA, and aligns skeleton and language spaces through contrastive learning and CLIP-based text encoding to enable both accurate classification and descriptive generation. The approach is pre-trained on skeleton-language data and fine-tuned on prompt-based QA tasks, achieving competitive emotion recognition across Emilya, KDAE, and EGBM while enabling rich emotion descriptions, even with limited labeled data. This framework advances explainable, cross-dataset emotion understanding in HCI by leveraging LLMs to produce text explanations grounded in skeletal movement patterns, with potential for broader multimodal and cross-domain applications.

Abstract

Emotion recognition based on body movements is vital in human-computer interaction. However, existing emotion recognition methods predominantly focus on enhancing classification accuracy, often neglecting the provision of textual explanations to justify their classifications. In this paper, we propose an Emotion-Action Interpreter powered by Large Language Model (EAI-LLM), which not only recognizes emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within large language models (LLMs). Specifically, we propose a multi-granularity skeleton tokenizer designed for LLMs, which separately extracts spatio-temporal tokens and semantic tokens from the skeleton data. This approach allows LLMs to generate more nuanced classification descriptions while maintaining robust classification performance. Furthermore, we treat the skeleton sequence as a specific language and propose a unified skeleton token module. This module leverages the extensive background knowledge and language processing capabilities of LLMs to address the challenges of joint training on heterogeneous datasets, thereby significantly enhancing recognition accuracy on individual datasets. Experimental results demonstrate that our model achieves recognition accuracy comparable to existing methods. More importantly, with the support of background knowledge from LLMs, our model can generate detailed emotion descriptions based on classification results, even when trained on a limited amount of labeled skeleton data.

Paper Structure

This paper contains 30 sections, 5 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: This work presents a novel approach for 3D full-body skeleton-based emotion recognition using fine-tuned LLMs, termed EAI-LLM. Unlike previous methods, EAI-LLM not only identifies emotions but also generates textual explanations by treating 3D body movement data as unique input tokens within the LLMs.
  • Figure 2: Diagram of skeleton-language alignment.
  • Figure 3: Confusion matrices for Emilya dataset using different training strategies.
  • Figure 4: Examples for emotion description capabilities of EAI-LLM. Underlined text indicates that the description is unrelated to the input sequences.