XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Hanwen Zhang; Yao Liu; Peiyuan Jiang; Lang Junjie; Xie Jun; Yihui He; Yajiao Deng; Siyu Du; Qiao Liu

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Hanwen Zhang, Yao Liu, Peiyuan Jiang, Lang Junjie, Xie Jun, Yihui He, Yajiao Deng, Siyu Du, Qiao Liu

TL;DR

XEmoGPT tackles cue-level explainable emotion recognition by adding Video and Audio Emotional Cue Bridges to standard multimodal LLM pipelines, enabling fine-grained cue perception and reasoning. It introduces the EmoCue dataset and EmoCue-360 metric to provide cue-level supervision and evaluation, plus EmoCue-Eval as a large human-annotated benchmark. Empirical results show state-of-the-art performance in cue perception and reasoning, with strong robustness to prompts and styles and clear gains from modality-specific bridging. This work advances practical EMER by delivering verifiable, cue-grounded explanations that support deployment in human-computer interaction and social analytics.

Abstract

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

TL;DR

Abstract

Paper Structure (35 sections, 11 equations, 10 figures, 5 tables)

This paper contains 35 sections, 11 equations, 10 figures, 5 tables.

Introduction
Related Work
Multimodal Large Language Models
Explainable Multimodal Emotion Recognition
Methodology
Model Architecture
Video Emotional Cue Bridge
Audio Emotional Cue Bridge
Video Emotional Cue Encoding
Contrastive Video Emotional Cue Alignment
Frame Temporal Discrimination
Masked Frame Modeling
Audio Emotional Cue Encoding
Contrastive Audio Emotion-Cue Alignment
Training Process
...and 20 more sections

Figures (10)

Figure 1: (a) Comparison between Multimodal Emotion Recognition models and Explainable Multimodal Emotion Recognition models. (b) Comparison between XEmoGPT and other Emotional MLLMs: Green/red text indicates emotional predictions with/without explicit cue-level explanations.
Figure 2: Architecture of XEmoGPT: It integrates visual, auditory, and textual information to generate a description containing both visual and auditory emotional cues. The VECB and AECB modules primarily serve to enhance the emotional cue perception capabilities of the modality encoders.
Figure 3: Training process of VECB and AECB: The VECB module is trained with three auxiliary tasks: Contrastive Video Emotional Cue Alignment, Frame Temporal Discrimination, and Masked Frame Modeling. The AECB module is trained with the Contrastive Audio Emotional Cue Alignment task.
Figure 4: Computation pipeline of the EmoCue-360 metric: Information Extraction, Cue Vectorization, and Metric Computation. The mathematical formulas used correspond to those in Section \ref{['sec:eval']}.
Figure 5: Comparison of quality between EmoCue-Eval and EMER datasets, showing annotation length (column 1) and the distribution of emotional cue counts (columns 2–4).
...and 5 more figures

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

TL;DR

Abstract

XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)