Table of Contents
Fetching ...

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition

Zheng Lian, Haiyang Sun, Licai Sun, Haoyu Chen, Lan Chen, Hao Gu, Zhuofan Wen, Shun Chen, Siyuan Zhang, Hailiang Yao, Bin Liu, Rui Liu, Shan Liang, Ya Li, Jiangyan Yi, Jianhua Tao

TL;DR

This work introduces Open-Vocabulary Multimodal Emotion Recognition (OV-MER), arguing that fixed emotion taxonomies fail to capture the full spectrum of human affect. It proposes OV-MERD, a dataset built via a human–LLM collaboration pipeline that expands annotations beyond predefined labels, and defines new evaluation metrics based on groupings from GPT-based and emotion-wheel approaches. The authors demonstrate that CLUE-Multi, CLUE-Video, and CLUE-Audio–based generations yield strong upper-bound performance, while open-vocabulary sentiment remains challenging for current Multimodal LLMs (MLLMs). They further show that emotion-wheel based groupings can closely track GPT-based metrics at lower cost, and that OV-MERD labels align well with human perception. Overall, OV-MERD advances MER toward real-world applicability by enabling richer, more nuanced emotional representations and providing a foundation for future, open-vocabulary emotion AI research.

Abstract

Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multi-appraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept of open vocabulary into MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm: Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios. Code and dataset are available at: https://github.com/zeroQiaoba/AffectGPT.

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition

TL;DR

This work introduces Open-Vocabulary Multimodal Emotion Recognition (OV-MER), arguing that fixed emotion taxonomies fail to capture the full spectrum of human affect. It proposes OV-MERD, a dataset built via a human–LLM collaboration pipeline that expands annotations beyond predefined labels, and defines new evaluation metrics based on groupings from GPT-based and emotion-wheel approaches. The authors demonstrate that CLUE-Multi, CLUE-Video, and CLUE-Audio–based generations yield strong upper-bound performance, while open-vocabulary sentiment remains challenging for current Multimodal LLMs (MLLMs). They further show that emotion-wheel based groupings can closely track GPT-based metrics at lower cost, and that OV-MERD labels align well with human perception. Overall, OV-MERD advances MER toward real-world applicability by enabling richer, more nuanced emotional representations and providing a foundation for future, open-vocabulary emotion AI research.

Abstract

Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multi-appraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept of open vocabulary into MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm: Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios. Code and dataset are available at: https://github.com/zeroQiaoba/AffectGPT.
Paper Structure (68 sections, 11 equations, 20 figures, 16 tables)

This paper contains 68 sections, 11 equations, 20 figures, 16 tables.

Figures (20)

  • Figure 1: Comparison. (a) Task Comparison: We compare the differences among three tasks (one-hot MER, multi-label MER, and OV-MER) across three aspects (label space, label number, and annotation manner). An in-depth comparison is provided in the Appendix \ref{['appendix:task_comparison']}; (b) Label Comparison: We provide an example to visualize the one-hot and OV labels. More examples are provided in Appendix \ref{['appendix:more_examples']}. Since the original video contains real people, we use https://www.domoai.app/zh-Hant/home to remove personal information to address copyright concerns. In this paper, we use emotion-related descriptions as a bridge to extract OV labels. We observe that OV labels offer a more insightful understanding of the emotional state.
  • Figure 2: Dataset construction. (a) CLUE-Multi Generation: For audio and video, we use audio LLM (ALLM) and video LLM (VLLM) to extract initial clues, followed by two rounds of manual checks to eliminate errors and duplicates while adding missing content. Each round involves multiple annotators, with no overlap between annotators in the two rounds. Finally, we merge the checked clues with text to generate CLUE-Multi. (b) Ground-truth OV Label Extraction: There are certain differences in the labels extracted from different languages. To eliminate language influence and achieve consensus labels, we merge these labels and conduct manual checks. These checked labels are regarded as the ground truth.
  • Figure 3: Baselines. (a) Preliminary: We begin by defining some preliminary symbols. (b) CLUE Generation: CLUE-Video and CLUE-Audio use manually-checked clues; CLUE-Text relies solely on text; CLUE-MLLM does not involve manual checks and directly uses the outputs from ALLM or VLLM. (c) Metric Calculation: We rely on CLUE to predict emotion labels. Due to variations in labels extracted from different languages, we report results across different languages.
  • Figure 4: Human-only (H) vs. Human-LLM (H+L) strategy.
  • Figure 5: Performance comparison of different strategies for generating CLUE-MLLM.
  • ...and 15 more figures