Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier
Hyeongseop Rha, Jeong Hun Yeo, Yeonju Kim, Yong Man Ro
TL;DR
This work tackles the misalignment between predicted emotions and their natural-language explanations in multimodal LLMs. It introduces the Emotional Rationale Verifier (ERV), a distilled, lightweight judge of explanation-emotion consistency, and an Explanation Reward that guides reasoning without changing model architecture or requiring extra paired video-description data. Through a three-stage pipeline (SFT, ERV training, and GRPO RL with coherence rewards), the approach improves Explanation Emotion Accuracy, Faithful Consistency Rate, and Explanation–Prediction Consistency on MER benchmarks (MAFW, DFEW) while maintaining emotion recognition performance. Human evaluations corroborate increased coherence and emotional grounding of explanations, supporting more trustworthy and interpretable multimodal interactions. The proposed method advances emotionally intelligent HCI by ensuring explanations faithfully reflect underlying affective states in video–audio inputs, with robust metric validation and cross-evaluator consistency.
Abstract
The recent advancement of Multimodal Large Language Models (MLLMs) is transforming human-computer interaction (HCI) from surface-level exchanges into more nuanced and emotionally intelligent communication. To realize this shift, emotion understanding becomes essential allowing systems to capture subtle cues underlying user intent. Furthermore, providing faithful explanations for predicted emotions is crucial to ensure interpretability and build user trust. However, current MLLM-based methods often generate emotion explanations that diverge from the target labels and sometimes even contradict their own predicted emotions. This inconsistency poses a critical risk for misunderstanding and erodes reliability in interactive settings. To address this, we propose a novel approach: the Emotional Rationale Verifier (ERV) and an Explanation Reward. Our method guides the model to produce reasoning that is explicitly consistent with the target emotion during multimodal emotion recognition without modifying the model architecture or requiring additional paired video-description annotations. Our method significantly improves faithful explanation-prediction consistency and explanation emotion accuracy on the MAFW and DFEW datasets. Through extensive experiments and human evaluations, we show that our approach not only enhances alignment between explanation and prediction but also empowers MLLMs to deliver emotionally coherent, trustworthy interactions, marking a key step toward truly human-like HCI systems.
