Table of Contents
Fetching ...

Explainable Multimodal Emotion Recognition

Zheng Lian, Haiyang Sun, Licai Sun, Hao Gu, Zhuofan Wen, Siyuan Zhang, Shun Chen, Mingyu Xu, Ke Xu, Kang Chen, Lan Chen, Shan Liang, Ya Li, Jiangyan Yi, Bin Liu, Jianhua Tao

TL;DR

This work defines Explainable Multimodal Emotion Recognition (EMER), a task that extends traditional emotion recognition by requiring explanations and open-vocabulary labels grounded in multimodal evidence. It introduces a new EMER dataset derived from MER2023, along with a structured data-annotation pipeline (pre-labeling, two-round checks, disambiguation) to extract rich visual, acoustic, lexical, and open-emotion cues. Through large language models and careful evaluation, the authors show that EMER enables richer, more reliable emotion labels and can benchmark multimodal LLMs, though current models still struggle to reach ground-truth performance. The study also analyzes language and modality effects, subtitle integration strategies, and the relationship between emotion-focused and text-based evaluation metrics, proposing EMER as a general format for open-emotion understanding tasks with practical implications for human-computer interaction.

Abstract

Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on developing more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow large language models (LLMs) to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs.

Explainable Multimodal Emotion Recognition

TL;DR

This work defines Explainable Multimodal Emotion Recognition (EMER), a task that extends traditional emotion recognition by requiring explanations and open-vocabulary labels grounded in multimodal evidence. It introduces a new EMER dataset derived from MER2023, along with a structured data-annotation pipeline (pre-labeling, two-round checks, disambiguation) to extract rich visual, acoustic, lexical, and open-emotion cues. Through large language models and careful evaluation, the authors show that EMER enables richer, more reliable emotion labels and can benchmark multimodal LLMs, though current models still struggle to reach ground-truth performance. The study also analyzes language and modality effects, subtitle integration strategies, and the relationship between emotion-focused and text-based evaluation metrics, proposing EMER as a general format for open-emotion understanding tasks with practical implications for human-computer interaction.

Abstract

Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on developing more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow large language models (LLMs) to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs.
Paper Structure (33 sections, 3 equations, 8 figures, 8 tables)

This paper contains 33 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: One example ("sample_00000669") to illustrate the differences between the one-hot label, EMER description, and EMER-based open vocabulary labels.
  • Figure 2: Pipeline of generating multimodal descriptions EMER(Multi).
  • Figure 3: Language influence analysis.
  • Figure 4: Pipeline for generating unimodal and multimodal descriptions.
  • Figure 5: Performance of different subtitle integration strategies on varying MLLMs.
  • ...and 3 more figures