Table of Contents
Fetching ...

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Deng Li, Xin Liu, Bohao Xing, Baiqiang Xia, Yuan Zong, Bihan Wen, Heikki Kälviäinen

TL;DR

The paper presents EALD, a de-identified long-sequence video dataset capturing post-match athlete interviews and associated non-facial body language (NFBL) cues, addressing privacy concerns and the lack of long-sequence data in emotion analysis. It introduces EALD-MLLM, a three-stage pipeline that de-identifies video and audio, uses Video-LLaMA to fuse de-identified multimodal inputs with NFBL textual cues, and employs ChatGPT to produce final emotion judgments, evaluated in a zero-shot setting. Results show that multimodal fusion and NFBL improve performance over single-modal baselines, with EALD-MLLM achieving competitive accuracy around $58.1 ext{%}$ and higher F1 scores, highlighting NFBL as a meaningful identity-free signal for long-sequence emotion analysis. Limitations include challenges in NFBL detection and the two-stage MLLM approach, motivating future work toward end-to-end MLLM models and more robust NFBL detection for privacy-preserving emotion analytics with practical impact. The dataset and baseline provide a foundation for privacy-conscious, long-duration emotion understanding in real-world applications.

Abstract

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.

EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

TL;DR

The paper presents EALD, a de-identified long-sequence video dataset capturing post-match athlete interviews and associated non-facial body language (NFBL) cues, addressing privacy concerns and the lack of long-sequence data in emotion analysis. It introduces EALD-MLLM, a three-stage pipeline that de-identifies video and audio, uses Video-LLaMA to fuse de-identified multimodal inputs with NFBL textual cues, and employs ChatGPT to produce final emotion judgments, evaluated in a zero-shot setting. Results show that multimodal fusion and NFBL improve performance over single-modal baselines, with EALD-MLLM achieving competitive accuracy around and higher F1 scores, highlighting NFBL as a meaningful identity-free signal for long-sequence emotion analysis. Limitations include challenges in NFBL detection and the two-stage MLLM approach, motivating future work toward end-to-end MLLM models and more robust NFBL detection for privacy-preserving emotion analytics with practical impact. The dataset and baseline provide a foundation for privacy-conscious, long-duration emotion understanding in real-world applications.

Abstract

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal Large Language Models (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
Paper Structure (13 sections, 6 equations, 8 figures, 4 tables)

This paper contains 13 sections, 6 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Selected samples of non-facial body language with the masked face of the proposed dataset EALD.
  • Figure 2: The comparison of the duration of videos of iMiGUE and EALD. The X-axis denotes the length of the videos, and Y-axis denotes the number of videos.
  • Figure 3: Comparsion of the waveform of original with de-identity audio of a sample (sample id 275) of EALD
  • Figure 4: The distribution of various non-facial body language in EALD. The y-axis represents the frequency, and the x-axis represents the NFBL. The orange, green, and blue colors represent different types of NFBL, namely, Self-manipulations, Manipulation (touching) objects, and Self-protection behaviors, respectively. Viewed digitally and zoom-in may be better.
  • Figure 5: The pipeline of the proposed EALD-MLLM.
  • ...and 3 more figures