Table of Contents
Fetching ...

Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro

TL;DR

This work tackles the misalignment between predicted emotions and their natural-language explanations in multimodal LLMs. It introduces the Emotional Rationale Verifier (ERV), a distilled, lightweight judge of explanation-emotion consistency, and an Explanation Reward that guides reasoning without changing model architecture or requiring extra paired video-description data. Through a three-stage pipeline (SFT, ERV training, and GRPO RL with coherence rewards), the approach improves Explanation Emotion Accuracy, Faithful Consistency Rate, and Explanation–Prediction Consistency on MER benchmarks (MAFW, DFEW) while maintaining emotion recognition performance. Human evaluations corroborate increased coherence and emotional grounding of explanations, supporting more trustworthy and interpretable multimodal interactions. The proposed method advances emotionally intelligent HCI by ensuring explanations faithfully reflect underlying affective states in video–audio inputs, with robust metric validation and cross-evaluator consistency.

Abstract

The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier

TL;DR

This work tackles the misalignment between predicted emotions and their natural-language explanations in multimodal LLMs. It introduces the Emotional Rationale Verifier (ERV), a distilled, lightweight judge of explanation-emotion consistency, and an Explanation Reward that guides reasoning without changing model architecture or requiring extra paired video-description data. Through a three-stage pipeline (SFT, ERV training, and GRPO RL with coherence rewards), the approach improves Explanation Emotion Accuracy, Faithful Consistency Rate, and Explanation–Prediction Consistency on MER benchmarks (MAFW, DFEW) while maintaining emotion recognition performance. Human evaluations corroborate increased coherence and emotional grounding of explanations, supporting more trustworthy and interpretable multimodal interactions. The proposed method advances emotionally intelligent HCI by ensuring explanations faithfully reflect underlying affective states in video–audio inputs, with robust metric validation and cross-evaluator consistency.

Abstract

The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.

Paper Structure

This paper contains 33 sections, 6 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustrative comparison between the baseline and our model on explanation–emotion alignment (text truncated for brevity). While both correctly predict the final emotion, the baseline fails to generate an emotionally coherent explanation, highlighting its misalignment.
  • Figure 2: (a): The policy model $\pi_{\theta}$ generates $G$ responses $o_1$, ..., $o_G$, and the Emotional Rationale Verifier (ERV) assigns an explanation reward $R_E$ to each response. (b): For each response $o_i$, its explanation $E_i$ is extracted and evaluated to produce its reward $R_{i,E}$. The ground truth emotion used for evaluation in this scenario is 'Happy'($e_{gt}=$ 'Happy'). Collectively, $R_E = \{ R_{1,E}, R_{2,E}, \ldots, R_{G,E} \}$ denotes the set of explanation rewards for the $G$ responses.
  • Figure 3: Prompt used to evaluate the emotion conveyed in the generated explanation. {Emotion List} is a permutation of the GT emotion label according to the evaluation dataset. {Explanation} is $E_i$ from output $o_i$.
  • Figure 4: Two questions used to evaluate the emotion conveyed in the generated explanation. {Emotion Explanation} can be a generated output from each R1-Omni and our model. (Input Video) means video is accompanied by the question.
  • Figure 5: The actual survey platform provided to participants in the human study.
  • ...and 6 more figures