Table of Contents
Fetching ...

AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition

Zheng Lian, Haiyang Sun, Licai Sun, Jiangyan Yi, Bin Liu, Jianhua Tao

TL;DR

This work tackles the data scarcity in Explainable Multimodal Emotion Recognition (EMER) by constructing EMER-Coarse, a large-scale coarsely labeled dataset derived from MER2024-SEMI, and introducing AffectGPT, a two-stage training framework. Stage1 trains on EMER-Coarse to learn coarse mappings from audio–video–text inputs to emotion-related descriptions, while Stage2 fine-tunes on the smaller, manually-checked EMER-Fine to align outputs with high-quality labels. Across ablations, Stage1–Stage2 consistently outperforms baselines and showcases the value of large-scale coarse supervision for multimodal emotion understanding, with careful analyses of LLM choices and initialization. The approach enables scalable, explainable EMER research and provides code and data to facilitate future development in open-vocabulary emotion understanding and multimodal reasoning.

Abstract

Explainable Multimodal Emotion Recognition (EMER) is an emerging task that aims to achieve reliable and accurate emotion recognition. However, due to the high annotation cost, the existing dataset (denoted as EMER-Fine) is small, making it difficult to perform supervised training. To reduce the annotation cost and expand the dataset size, this paper reviews the previous dataset construction process. Then, we simplify the annotation pipeline, avoid manual checks, and replace the closed-source models with open-source models. Finally, we build \textbf{EMER-Coarse}, a coarsely-labeled dataset containing large-scale samples. Besides the dataset, we propose a two-stage training framework \textbf{AffectGPT}. The first stage exploits EMER-Coarse to learn a coarse mapping between multimodal inputs and emotion-related descriptions; the second stage uses EMER-Fine to better align with manually-checked results. Experimental results demonstrate the effectiveness of our proposed method on the challenging EMER task. To facilitate further research, we will make the code and dataset available at: https://github.com/zeroQiaoba/AffectGPT.

AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition

TL;DR

This work tackles the data scarcity in Explainable Multimodal Emotion Recognition (EMER) by constructing EMER-Coarse, a large-scale coarsely labeled dataset derived from MER2024-SEMI, and introducing AffectGPT, a two-stage training framework. Stage1 trains on EMER-Coarse to learn coarse mappings from audio–video–text inputs to emotion-related descriptions, while Stage2 fine-tunes on the smaller, manually-checked EMER-Fine to align outputs with high-quality labels. Across ablations, Stage1–Stage2 consistently outperforms baselines and showcases the value of large-scale coarse supervision for multimodal emotion understanding, with careful analyses of LLM choices and initialization. The approach enables scalable, explainable EMER research and provides code and data to facilitate future development in open-vocabulary emotion understanding and multimodal reasoning.

Abstract

Explainable Multimodal Emotion Recognition (EMER) is an emerging task that aims to achieve reliable and accurate emotion recognition. However, due to the high annotation cost, the existing dataset (denoted as EMER-Fine) is small, making it difficult to perform supervised training. To reduce the annotation cost and expand the dataset size, this paper reviews the previous dataset construction process. Then, we simplify the annotation pipeline, avoid manual checks, and replace the closed-source models with open-source models. Finally, we build \textbf{EMER-Coarse}, a coarsely-labeled dataset containing large-scale samples. Besides the dataset, we propose a two-stage training framework \textbf{AffectGPT}. The first stage exploits EMER-Coarse to learn a coarse mapping between multimodal inputs and emotion-related descriptions; the second stage uses EMER-Fine to better align with manually-checked results. Experimental results demonstrate the effectiveness of our proposed method on the challenging EMER task. To facilitate further research, we will make the code and dataset available at: https://github.com/zeroQiaoba/AffectGPT.
Paper Structure (22 sections, 3 equations, 5 figures, 8 tables)

This paper contains 22 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Ablation study on stage1. In these figures, we train models with different initialization strategies and report their results on different sets. Besides the original accuracy curve, we also add a smoothed curve. Meanwhile, we introduce two baselines without stage1 training.
  • Figure 2: Impact of different initialization strategies. We plot the curve of training loss and accuracy. As for accuracy, we evaluate the performance on the entire EMER-Fine.
  • Figure 3: Impact of different LLMs. We use the random initialization strategy and evaluate the performance on the entire EMER-Fine.
  • Figure 4: Ablation study on stage2. In these figures, we show the results on different subsets.
  • Figure 5: Necessity of two-stage training framework.