Table of Contents
Fetching ...

AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

Zheng Lian, Haoyu Chen, Lan Chen, Haiyang Sun, Licai Sun, Yong Ren, Zebang Cheng, Bin Liu, Rui Liu, Xiaojiang Peng, Jiangyan Yi, Jianhua Tao

TL;DR

This work addresses the scarcity of large-scale descriptive emotion data and dedicated benchmarks for multimodal large language models (MLLMs) by introducing MER-Caption, a model-led human-assisted annotation pipeline that yields a 115K-sample descriptive emotion dataset (plus 31K fine-labeled cases). It also presents AffectGPT, a pre-fusion fusion mechanism that externalizes cross-modal interactions via Q-Former or attention modules to enhance multimodal integration before LLM decoding, and MER-UniBench, a unified benchmark with three MER tasks and tailored metrics for free-form outputs. Empirical results show substantial improvements over existing MLLMs, with MER-Caption data quality and the pre-fusion fusion strategy driving performance gains; code and data are released for community use. Overall, the paper advances descriptive emotion understanding in MLLMs and provides a scalable pipeline and evaluation framework to support future research in emotion AI.

Abstract

The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https://github.com/zeroQiaoba/AffectGPT.

AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models

TL;DR

This work addresses the scarcity of large-scale descriptive emotion data and dedicated benchmarks for multimodal large language models (MLLMs) by introducing MER-Caption, a model-led human-assisted annotation pipeline that yields a 115K-sample descriptive emotion dataset (plus 31K fine-labeled cases). It also presents AffectGPT, a pre-fusion fusion mechanism that externalizes cross-modal interactions via Q-Former or attention modules to enhance multimodal integration before LLM decoding, and MER-UniBench, a unified benchmark with three MER tasks and tailored metrics for free-form outputs. Empirical results show substantial improvements over existing MLLMs, with MER-Caption data quality and the pre-fusion fusion strategy driving performance gains; code and data are released for community use. Overall, the paper advances descriptive emotion understanding in MLLMs and provides a scalable pipeline and evaluation framework to support future research in emotion AI.

Abstract

The emergence of multimodal large language models (MLLMs) advances multimodal emotion recognition (MER) to the next level, from naive discriminative tasks to complex emotion understanding with advanced video understanding abilities and natural language description. However, the current community suffers from a lack of large-scale datasets with intensive, descriptive emotion annotations, as well as a multimodal-centric framework to maximize the potential of MLLMs for emotion understanding. To address this, we establish a new benchmark for MLLM-based emotion understanding with a novel dataset (MER-Caption) and a new model (AffectGPT). Utilizing our model-based crowd-sourcing data collection strategy, we construct the largest descriptive emotion dataset to date (by far), featuring over 2K fine-grained emotion categories across 115K samples. We also introduce the AffectGPT model, designed with pre-fusion operations to enhance multimodal integration. Finally, we present MER-UniBench, a unified benchmark with evaluation metrics tailored for typical MER tasks and the free-form, natural language output style of MLLMs. Extensive experimental results show AffectGPT's robust performance across various MER tasks. We have released both the code and the dataset to advance research and development in emotion understanding: https://github.com/zeroQiaoba/AffectGPT.

Paper Structure

This paper contains 47 sections, 13 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Emotion complexity analysis. Human emotions are often diverse and coexist simultaneously. Such complex emotional states are difficult to describe using discriminative frameworks. However, MLLMs can generate emotional descriptions, offering new possibilities for complex emotion modeling. Since the original videos contain real people, to address copyright concerns, we first use https://www.domoai.app/zh-Hant/home to remove personal information and then proceed with visualization.
  • Figure 2: Dataset construction pipeline. To create a large-scale dataset with guaranteed label quality, we propose a model-led, human-assisted annotation strategy. In this approach, we leverage human priors to guide description generation and sample filtering, ultimately achieving automatic annotation for unlabeled data.
  • Figure 3: Model comparison. ALLM and VLLM primarily use modality-specific encoders and align them with the LLM through projection layers. AV-LLM mainly facilitates cross-modal interaction within the language model. In AffectGPT, we move the cross-modal interaction outside the language model and use a pre-fusion operation to enhance multimodal integration. In these figures, $\mathbf{P}$ can be determined based on the requirement of whether to include $\mathbf{X_t}$.
  • Figure 4: Ablation studies on LLMs, audio encoders, and video encoders.
  • Figure 5: Visualization of MLLM outputs.
  • ...and 5 more figures