Table of Contents
Fetching ...

MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition

Zheng Lian, Licai Sun, Yong Ren, Hao Gu, Haiyang Sun, Lan Chen, Bin Liu, Jianhua Tao

TL;DR

MERBench tackles the lack of fair comparison in multimodal emotion recognition by introducing a unified benchmark and the Chinese MER2023 dataset. It systematically evaluates unimodal and multimodal features, fusion strategies, cross-corpus transfer, and robustness to punctuation and noise, under a standardized pipeline. Key contributions include a rigorous baseline suite, extensive cross-dataset analyses, and recommendations favoring pre-training and language-aware encoders, plus a public codebase. The work provides a practical framework for reproducible research and highlights directions for robust, scalable emotion recognition in real-world, multilingual settings.

Abstract

Multimodal emotion recognition plays a crucial role in enhancing user experience in human-computer interaction. Over the past few decades, researchers have proposed a series of algorithms and achieved impressive progress. Although each method shows its superior performance, different methods lack a fair comparison due to inconsistencies in feature extractors, evaluation manners, and experimental settings. These inconsistencies severely hinder the development of this field. Therefore, we build MERBench, a unified evaluation benchmark for multimodal emotion recognition. We aim to reveal the contribution of some important techniques employed in previous works, such as feature selection, multimodal fusion, robustness analysis, fine-tuning, pre-training, etc. We hope this benchmark can provide clear and comprehensive guidance for follow-up researchers. Based on the evaluation results of MERBench, we further point out some promising research directions. Additionally, we introduce a new emotion dataset MER2023, focusing on the Chinese language environment. This dataset can serve as a benchmark dataset for research on multi-label learning, noise robustness, and semi-supervised learning. We encourage the follow-up researchers to evaluate their algorithms under the same experimental setup as MERBench for fair comparisons. Our code is available at: https://github.com/zeroQiaoba/MERTools.

MERBench: A Unified Evaluation Benchmark for Multimodal Emotion Recognition

TL;DR

MERBench tackles the lack of fair comparison in multimodal emotion recognition by introducing a unified benchmark and the Chinese MER2023 dataset. It systematically evaluates unimodal and multimodal features, fusion strategies, cross-corpus transfer, and robustness to punctuation and noise, under a standardized pipeline. Key contributions include a rigorous baseline suite, extensive cross-dataset analyses, and recommendations favoring pre-training and language-aware encoders, plus a public codebase. The work provides a practical framework for reproducible research and highlights directions for robust, scalable emotion recognition in real-world, multilingual settings.

Abstract

Multimodal emotion recognition plays a crucial role in enhancing user experience in human-computer interaction. Over the past few decades, researchers have proposed a series of algorithms and achieved impressive progress. Although each method shows its superior performance, different methods lack a fair comparison due to inconsistencies in feature extractors, evaluation manners, and experimental settings. These inconsistencies severely hinder the development of this field. Therefore, we build MERBench, a unified evaluation benchmark for multimodal emotion recognition. We aim to reveal the contribution of some important techniques employed in previous works, such as feature selection, multimodal fusion, robustness analysis, fine-tuning, pre-training, etc. We hope this benchmark can provide clear and comprehensive guidance for follow-up researchers. Based on the evaluation results of MERBench, we further point out some promising research directions. Additionally, we introduce a new emotion dataset MER2023, focusing on the Chinese language environment. This dataset can serve as a benchmark dataset for research on multi-label learning, noise robustness, and semi-supervised learning. We encourage the follow-up researchers to evaluate their algorithms under the same experimental setup as MERBench for fair comparisons. Our code is available at: https://github.com/zeroQiaoba/MERTools.
Paper Structure (24 sections, 8 equations, 9 figures, 20 tables, 1 algorithm)

This paper contains 24 sections, 8 equations, 9 figures, 20 tables, 1 algorithm.

Figures (9)

  • Figure 1: Pipeline of data annotation.
  • Figure 2: Empirical PDFs and estimated Gaussian models on sample lengths for different subsets.
  • Figure 3: Distribution of discrete emotions for different subsets (neutral, anger, happiness, sadness, worry, surprise).
  • Figure 4: Empirical PDF on the valence for different discrete emotions. We calculate statistics using all valence-labeled samples.
  • Figure 5: Impact of language matching for acoustic encoders. In this table, we reveal the relationship between the primary training language of the acoustic encoder and the input language.
  • ...and 4 more figures