Table of Contents
Fetching ...

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng

TL;DR

The paper addresses the challenge of evaluating emotional intelligence in multimodal large language models by introducing MME-Emotion, a large-scale benchmark with 6,500 QA pairs across eight emotional tasks and 27 scenarios. It presents a holistic, automated evaluation framework using a multi-agent MLLM-as-judge system to measure emotion recognition (Rec-S) and reasoning (Rea-S), combined into a CoT-S score, with human verification validating the approach. Empirical results across 20 MLLMs reveal that overall emotional intelligence remains limited, with top CoT scores around 53–56% and recognition scores below 40%, and show that both generalist and emotion-specialist models can achieve competitive results via different strategies. The findings highlight the need for improved multimodal fusion, deeper reasoning, and richer visual/audio perception, establishing MME-Emotion as a foundation for future development of emotion-aware, omnimodal LLMs for real-world applications.

Abstract

Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only $39.3\%$ recognition score and $56.0\%$ Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

TL;DR

The paper addresses the challenge of evaluating emotional intelligence in multimodal large language models by introducing MME-Emotion, a large-scale benchmark with 6,500 QA pairs across eight emotional tasks and 27 scenarios. It presents a holistic, automated evaluation framework using a multi-agent MLLM-as-judge system to measure emotion recognition (Rec-S) and reasoning (Rea-S), combined into a CoT-S score, with human verification validating the approach. Empirical results across 20 MLLMs reveal that overall emotional intelligence remains limited, with top CoT scores around 53–56% and recognition scores below 40%, and show that both generalist and emotion-specialist models can achieve competitive results via different strategies. The findings highlight the need for improved multimodal fusion, deeper reasoning, and richer visual/audio perception, establishing MME-Emotion as a foundation for future development of emotion-aware, omnimodal LLMs for real-world applications.

Abstract

Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad scenarios to formulate eight emotional tasks. It further incorporates a holistic evaluation suite with hybrid metrics for emotion recognition and reasoning, analyzed through a multi-agent system framework. Through a rigorous evaluation of 20 advanced MLLMs, we uncover both their strengths and limitations, yielding several key insights: \ding{182} Current MLLMs exhibit unsatisfactory emotional intelligence, with the best-performing model achieving only recognition score and Chain-of-Thought (CoT) score on our benchmark. \ding{183} Generalist models (\emph{e.g.}, Gemini-2.5-Pro) derive emotional intelligence from generalized multimodal understanding capabilities, while specialist models (\emph{e.g.}, R1-Omni) can achieve comparable performance through domain-specific post-training adaptation. By introducing MME-Emotion, we hope that it can serve as a foundation for advancing MLLMs' emotional intelligence in the future.

Paper Structure

This paper contains 34 sections, 3 equations, 21 figures, 10 tables.

Figures (21)

  • Figure 1: Overview of MME-Emotion Statistics.Left: Task Types. MME-Emotion encompasses eight emotional tasks across 27 distinct scenario types, enabling fine-grained analysis of diverse video contexts. Right: Data Distributions. MME-Emotion features balanced distributions of question volume and video duration, facilitating comprehensive evaluation of temporal understanding.
  • Figure 2: Performance Comparison of Leading MLLMs on MME-Emotion. Our evaluation suite assesses MLLMs using three unified metrics (Rec-S, Rea-S, and CoT-S) across eight emotional tasks.
  • Figure 3: Illustration of Our Evaluation Strategy. We leverage a multi-agent system framework to assess the recognition and reasoning capabilities of MLLMs across different tasks with three unified metrics. To validate the effectiveness of our MLLM-as-judge strategy, we further compare the results of the judge agent on sampled data against results cross-evaluated by five human experts.
  • Figure 4: Task-level Performance Comparison ($\%$) on MME-Emotion. We showcase fine-grained comparison results of 20 MLLMs using Rec-S, Rea-S, and CoT-S across 8 emotional tasks.
  • Figure 5: Relationships among Model Evaluation Metrics.Left Panel: Relationship between average steps and CoT scores. Center Panel: Relationship between reasoning and recognition scores. Right Panel: Pearson correlation quantifying inter-dependencies among five evaluation metrics.
  • ...and 16 more figures