Table of Contents
Fetching ...

Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models

Yuxiang Lin, Jingdong Sun, Zhi-Qi Cheng, Jue Wang, Haomin Liang, Zebang Cheng, Yifei Dong, Jun-Yan He, Xiaojiang Peng, Xian-Sheng Hua

TL;DR

This work presents EIBench, a large-scale benchmark encompassing 1615 basic EI samples and 50 complex EI samples featuring multifaceted emotions, and proposes a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale.

Abstract

Most existing emotion analysis emphasizes which emotion arises (e.g., happy, sad, angry) but neglects the deeper why. We propose Emotion Interpretation (EI), focusing on causal factors-whether explicit (e.g., observable objects, interpersonal interactions) or implicit (e.g., cultural context, off-screen events)-that drive emotional responses. Unlike traditional emotion recognition, EI tasks require reasoning about triggers instead of mere labeling. To facilitate EI research, we present EIBench, a large-scale benchmark encompassing 1,615 basic EI samples and 50 complex EI samples featuring multifaceted emotions. Each instance demands rationale-based explanations rather than straightforward categorization. We further propose a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale. Extensive evaluations on open-source and proprietary large language models under four experimental settings reveal consistent performance gaps-especially for more intricate scenarios-underscoring EI's potential to enrich empathetic, context-aware AI applications. Our benchmark and methods are publicly available at: https://github.com/Lum1104/EIBench, offering a foundation for advanced multimodal causal analysis and next-generation affective computing.

Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models

TL;DR

This work presents EIBench, a large-scale benchmark encompassing 1615 basic EI samples and 50 complex EI samples featuring multifaceted emotions, and proposes a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale.

Abstract

Most existing emotion analysis emphasizes which emotion arises (e.g., happy, sad, angry) but neglects the deeper why. We propose Emotion Interpretation (EI), focusing on causal factors-whether explicit (e.g., observable objects, interpersonal interactions) or implicit (e.g., cultural context, off-screen events)-that drive emotional responses. Unlike traditional emotion recognition, EI tasks require reasoning about triggers instead of mere labeling. To facilitate EI research, we present EIBench, a large-scale benchmark encompassing 1,615 basic EI samples and 50 complex EI samples featuring multifaceted emotions. Each instance demands rationale-based explanations rather than straightforward categorization. We further propose a Coarse-to-Fine Self-Ask (CFSA) annotation pipeline, which guides Vision-Language Models (VLLMs) through iterative question-answer rounds to yield high-quality labels at scale. Extensive evaluations on open-source and proprietary large language models under four experimental settings reveal consistent performance gaps-especially for more intricate scenarios-underscoring EI's potential to enrich empathetic, context-aware AI applications. Our benchmark and methods are publicly available at: https://github.com/Lum1104/EIBench, offering a foundation for advanced multimodal causal analysis and next-generation affective computing.

Paper Structure

This paper contains 30 sections, 4 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Illustrative examples of Emotion Interpretation in five categories: (a) Angry, (b) Sad, (c) Happy, (d) Excited, and (e) Complex. Each panel shows a scenario with potential triggers (e.g., service frustrations, medical news, festive attire, family interactions). In (e), multiple triggers or viewpoints co-occur: a child upset about craft-making and a caregiver’s frustration. By integrating facial cues, context, and domain knowledge, this approach surpasses mere emotion labeling, clarifying why individuals feel a certain way.
  • Figure 2: Distribution of emotional triggers across distinct categories, contrasting Basic Emotions (left) and Complex Emotions (right). Each slice represents the proportion of triggers category.
  • Figure 3: Pipeline of the VLLM-assisted dataset construction.
  • Figure 4: Visualization of the numbers of emotional triggers across different categories (Basic Emotions).
  • Figure 5: Visualization of the numbers of emotional triggers in the Complex EI Subset.