Table of Contents
Fetching ...

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou

TL;DR

The paper tackles inconsistent assessments of MLLMs' visual emotion understanding by introducing the Emotion Statement Judgment (ESJ) task and the INSETS annotation pipeline, enabling open-vocabulary emotion tagging and multifaceted statement construction. Building on these, the authors present the MVEI benchmark to evaluate four dimensions of affective cognition—sentiment polarity, emotion interpretation, scene context, and perception subjectivity—across a large, scalable dataset. Comprehensive experiments show that while modern MLLMs improve substantially, they still lag behind humans, especially in polarity and subjectivity; targeted adaptations yield gains but do not close the perceptual-subjectivity gap. Together, ESJ, INSETS, and MVEI provide a rigorous foundation for advancing emotional intelligence in MLLMs and guiding future research in data and model design.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.

Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

TL;DR

The paper tackles inconsistent assessments of MLLMs' visual emotion understanding by introducing the Emotion Statement Judgment (ESJ) task and the INSETS annotation pipeline, enabling open-vocabulary emotion tagging and multifaceted statement construction. Building on these, the authors present the MVEI benchmark to evaluate four dimensions of affective cognition—sentiment polarity, emotion interpretation, scene context, and perception subjectivity—across a large, scalable dataset. Comprehensive experiments show that while modern MLLMs improve substantially, they still lag behind humans, especially in polarity and subjectivity; targeted adaptations yield gains but do not close the perceptual-subjectivity gap. Together, ESJ, INSETS, and MVEI provide a rigorous foundation for advancing emotional intelligence in MLLMs and guiding future research in data and model design.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.

Paper Structure

This paper contains 26 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparison between current emotion evaluation approaches and the proposed ESJ.
  • Figure 2: Illustration of the open-vocabulary emotion tagging stage. We first extract all potential open-vocabulary emotions from the image dataset (a) and then attach these emotions to a well-established emotion model (b,c). Through this model (d), we identify and select open-vocabulary emotions consistently recognized by multiple MLLMs as the labels of each image (e).
  • Figure 3: Illustration of the emotional statement construction stage. It begins with prototype statement generation (a) for each emotion label, which is distributed across multiple MLLMs. Then, based on the assigned emotion labels and the corresponding prototype statements, correct and incorrect emotion-centric statements are constructed from four dimensions: sentiment polarity (b), emotion interpretation (c), scene context (d), and perception subjectivity (e).
  • Figure 4: A closer gaze at MVEI. Illustrations of a sample (a), the distribution of emotion labels (b), and the distribution of emotion-centric statements (c).
  • Figure 5: Samples that are deemed ambiguous during the human refinement process.
  • ...and 8 more figures