Table of Contents
Fetching ...

Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

Hyeongseop Rha, Jeong Hun Yeo, Junil Won, Se Jin Park, Yong Man Ro

TL;DR

This work addresses reliability gaps in reasoning-based multimodal emotion understanding by introducing Modality-Importance-Guided Reasoning (MIGR), which identifies the emotion-dominant modality and reorganizes reasoning to begin with that modality. It combines modality-aligned supervised fine-tuning and modality-aware reward optimization to enforce MI-guided, emotion-grounded explanations. MIGR achieves substantially higher explanation–prediction consistency and reduces emotionally inconsistent reasoning on the DFEW/MAFW benchmarks, though it encounters challenges with the Surprise category due to inherent annotation ambiguity. Overall, MIGR advances trustworthy, interpretable multimodal emotion reasoning by aligning data organization and optimization with modality dominance.

Abstract

In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.

Learning What to Attend First: Modality-Importance-Guided Reasoning for Reliable Multimodal Emotion Understanding

TL;DR

This work addresses reliability gaps in reasoning-based multimodal emotion understanding by introducing Modality-Importance-Guided Reasoning (MIGR), which identifies the emotion-dominant modality and reorganizes reasoning to begin with that modality. It combines modality-aligned supervised fine-tuning and modality-aware reward optimization to enforce MI-guided, emotion-grounded explanations. MIGR achieves substantially higher explanation–prediction consistency and reduces emotionally inconsistent reasoning on the DFEW/MAFW benchmarks, though it encounters challenges with the Surprise category due to inherent annotation ambiguity. Overall, MIGR advances trustworthy, interpretable multimodal emotion reasoning by aligning data organization and optimization with modality dominance.

Abstract

In this paper, we present Modality-Importance-Guided Reasoning (MIGR), a framework designed to improve the reliability of reasoning-based multimodal emotion understanding in multimodal large language models. Although existing methods have advanced emotion understanding, they often suffer from reasoning drift: models gradually rely on their own generated text instead of multimodal evidence, and their explanations are overly shaped by visually initiated reasoning paths. To address these issues, we introduce Modality Importance (MI), a simple yet effective mechanism for identifying the emotion-dominant modality. Using MI, MIGR reorganizes reasoning sequences so that explanations begin from the modality most critical to the target emotion, preventing early reasoning from being misled by less informative cues. Our two-stage framework-comprising modality-aligned supervised fine-tuning and modality-aware reward optimization-encourages models to generate emotionally grounded, causally relevant, and coherence-preserving explanations. Experimental results on the DFEW benchmark show that MIGR substantially improves reasoning reliability, decreasing instances of correct predictions accompanied by emotionally inconsistent explanations from 18.10% to 7.37%. These results confirm the benefit of initiating reasoning from the emotion-dominant modality.

Paper Structure

This paper contains 36 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Failure pattern of the baseline multimodal reasoning model. (a) The model’s attention gradually drifts from external multimodal inputs to previously generated text. (b) Due to the fixed training order: visual description → audio description → reasoning, the model often begins with a visual-based summary; when this initial step is inaccurate or emotionally irrelevant, subsequent reasoning becomes increasingly misaligned.
  • Figure 2: Overview of the proposed data construction pipeline, including FAU-based emotion-consistent data augmentation, MI estimation, and MI-guided modality-specific reasoning reordering.
  • Figure 3: Illustration of the three rewards in MIGR: the Modality-Aligned Order Reward enforces MI-consistent reasoning order; the Modality-Grounded Reasoning Reward ensures emotion-consistent modality-specific reasoning; and the answer reward guarantees correct final emotion prediction.
  • Figure 4: Qualitative comparisons of emotion reasoning. (Left) For a speechless sample, baseline models fail to infer the emotion due to missing audio cues, whereas MIGR correctly focuses on the visual modality and integrates it within the final <think> step to produce a coherent conclusion. (Right) In an audio-dominant sample, MIGR first identifies the key audio cue (“sobbing”) and leverages it to interpret ambiguous visual information, leading to an accurate prediction. In contrast, other models misinterpret the visual cue (frowning) as anger and produce incorrect reasoning.
  • Figure 5: Comparison of modality-wise attention distributions over generated tokens among MIGR, ERV, and R1-Omni.
  • ...and 3 more figures