EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model
Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, Heikki Kälviäinen
TL;DR
The paper presents EALD, a de-identified long-sequence video dataset capturing post-match athlete interviews and associated non-facial body language (NFBL) cues, addressing privacy concerns and the lack of long-sequence data in emotion analysis. It introduces EALD-MLLM, a three-stage pipeline that de-identifies video and audio, uses Video-LLaMA to fuse de-identified multimodal inputs with NFBL textual cues, and employs ChatGPT to produce final emotion judgments, evaluated in a zero-shot setting. Results show that multimodal fusion and NFBL improve performance over single-modal baselines, with EALD-MLLM achieving competitive accuracy around $58.1 ext{%}$ and higher F1 scores, highlighting NFBL as a meaningful identity-free signal for long-sequence emotion analysis. Limitations include challenges in NFBL detection and the two-stage MLLM approach, motivating future work toward end-to-end MLLM models and more robust NFBL detection for privacy-preserving emotion analytics with practical impact. The dataset and baseline provide a foundation for privacy-conscious, long-duration emotion understanding in real-world applications.
Abstract
Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
