Table of Contents
Fetching ...

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Hanwen Zhang, Yao Liu, Peiyuan Jiang, Lang Junjie, Xie Jun, Yihui He, Yajiao Deng, Siyu Du, Qiao Liu

TL;DR

XEmoGPT tackles cue-level explainable emotion recognition by adding Video and Audio Emotional Cue Bridges to standard multimodal LLM pipelines, enabling fine-grained cue perception and reasoning. It introduces the EmoCue dataset and EmoCue-360 metric to provide cue-level supervision and evaluation, plus EmoCue-Eval as a large human-annotated benchmark. Empirical results show state-of-the-art performance in cue perception and reasoning, with strong robustness to prompts and styles and clear gains from modality-specific bridging. This work advances practical EMER by delivering verifiable, cue-grounded explanations that support deployment in human-computer interaction and social analytics.

Abstract

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

TL;DR

XEmoGPT tackles cue-level explainable emotion recognition by adding Video and Audio Emotional Cue Bridges to standard multimodal LLM pipelines, enabling fine-grained cue perception and reasoning. It introduces the EmoCue dataset and EmoCue-360 metric to provide cue-level supervision and evaluation, plus EmoCue-Eval as a large human-annotated benchmark. Empirical results show state-of-the-art performance in cue perception and reasoning, with strong robustness to prompts and styles and clear gains from modality-specific bridging. This work advances practical EMER by delivering verifiable, cue-grounded explanations that support deployment in human-computer interaction and social analytics.

Abstract

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
Paper Structure (35 sections, 11 equations, 10 figures, 5 tables)

This paper contains 35 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (a) Comparison between Multimodal Emotion Recognition models and Explainable Multimodal Emotion Recognition models. (b) Comparison between XEmoGPT and other Emotional MLLMs: Green/red text indicates emotional predictions with/without explicit cue-level explanations.
  • Figure 2: Architecture of XEmoGPT: It integrates visual, auditory, and textual information to generate a description containing both visual and auditory emotional cues. The VECB and AECB modules primarily serve to enhance the emotional cue perception capabilities of the modality encoders.
  • Figure 3: Training process of VECB and AECB: The VECB module is trained with three auxiliary tasks: Contrastive Video Emotional Cue Alignment, Frame Temporal Discrimination, and Masked Frame Modeling. The AECB module is trained with the Contrastive Audio Emotional Cue Alignment task.
  • Figure 4: Computation pipeline of the EmoCue-360 metric: Information Extraction, Cue Vectorization, and Metric Computation. The mathematical formulas used correspond to those in Section \ref{['sec:eval']}.
  • Figure 5: Comparison of quality between EmoCue-Eval and EMER datasets, showing annotation length (column 1) and the distribution of emotional cue counts (columns 2–4).
  • ...and 5 more figures