Table of Contents
Fetching ...

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Yu Liu, Lei Zhang, Haoxun Li, Hanlei Shi, Yuxuan Ding, Leyuan Qu, Taihao Li

Abstract

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.

Follow the Clues, Frame the Truth: Hybrid-evidential Deductive Reasoning in Open-Vocabulary Multimodal Emotion Recognition

Abstract

Open-Vocabulary Multimodal Emotion Recognition (OV-MER) is inherently challenging due to the ambiguity of equivocal multimodal cues, which often stem from distinct unobserved situational dynamics. While Multimodal Large Language Models (MLLMs) offer extensive semantic coverage, their performance is often bottlenecked by premature commitment to dominant data priors, resulting in suboptimal heuristics that overlook crucial, complementary affective cues across modalities. We argue that effective affective reasoning requires more than surface-level association; it necessitates reconstructing nuanced emotional states by synthesizing multiple evidence-grounded rationales that reconcile these observations from diverse latent perspectives. We introduce HyDRA, a Hybrid-evidential Deductive Reasoning Architecture that formalizes inference as a Propose-Verify-Decide protocol. To internalize this abductive process, we employ reinforcement learning with hierarchical reward shaping, aligning the reasoning trajectories with final task performance to ensure they best reconcile the observed multimodal cues. Systematic evaluations validate our design choices, with HyDRA consistently outperforming strong baselines--especially in ambiguous or conflicting scenarios--while providing interpretable, diagnostic evidence traces.
Paper Structure (19 sections, 9 equations, 2 figures, 5 tables)

This paper contains 19 sections, 9 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: HyDRA overview. (A) Propose--Verify--Decide inference protocol: given multimodal input, HyDRA proposes multiple latent-context hypotheses, adjudicates them via evidence-constrained comparison with explicit citations, and outputs the most plausible emotion set. (B) Learning HyDRA with GRPO and hierarchical reward shaping: for each input we sample a group of structured trajectories in the <hyp>--<think>--<ans> format, score them with six rewards, compute group-relative advantages, and update the policy to favor evidential closure and robust decisions under ambiguity. Textual rationales are grounded to human-verified textualized multimodal cues (ObsG) provided by the datasets, and enforced via semantic alignment to ground-truth cue descriptions.
  • Figure 2: Left: The original input (visual/audio/text). Visual priors contradict subtle audio/textual cues. Middle (Success): HyDRA resolves the conflict via the Propose–Verify–Decide protocol. Right (Failure): R1-omni commits prematurely to salient visual signals. Due to the presence of real individuals in the original videos, personal identifiable information has been removed and processed via visualization to address copyright and privacy concerns.