Table of Contents
Fetching ...

Mitigating Modal Imbalance in Multimodal Reasoning

Chen Henry Wu, Neil Kale, Aditi Raghunathan

TL;DR

This work investigates how foundation models reason across modalities and why cross-modal conflicts degrade performance. By constructing cross-modal conflict datasets (CMQA) and cross-lingual variants (CLQA), the authors reveal a pronounced gap between unimodal and cross-modal reasoning, traced to cross-modal attention imbalance. They show that simple data scaling does not fix this issue, but instance-level modality mixing—concatenating cross-modal instructions within the same training example—substantially reduces imbalance and improves performance on vision-language benchmarks. The findings highlight a practical, scalable path to more reliable multimodal reasoning and underscore the need for training paradigms that reflect real-world cross-modal contexts.

Abstract

Foundation models (FMs) deployed in real-world tasks such as computer-use agents must integrate diverse modalities. How good are FMs at performing joint reasoning, simultaneously reasoning over multiple modalities, especially when the modalities interact and relate to each other to form cross-modal context? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts, composed of a single modality, 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities -- similar observations hold in cross-lingual contexts, composed of multiple languages. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Reduced attention imbalance directly translates to improved downstream performance on several vision-language benchmarks. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.

Mitigating Modal Imbalance in Multimodal Reasoning

TL;DR

This work investigates how foundation models reason across modalities and why cross-modal conflicts degrade performance. By constructing cross-modal conflict datasets (CMQA) and cross-lingual variants (CLQA), the authors reveal a pronounced gap between unimodal and cross-modal reasoning, traced to cross-modal attention imbalance. They show that simple data scaling does not fix this issue, but instance-level modality mixing—concatenating cross-modal instructions within the same training example—substantially reduces imbalance and improves performance on vision-language benchmarks. The findings highlight a practical, scalable path to more reliable multimodal reasoning and underscore the need for training paradigms that reflect real-world cross-modal contexts.

Abstract

Foundation models (FMs) deployed in real-world tasks such as computer-use agents must integrate diverse modalities. How good are FMs at performing joint reasoning, simultaneously reasoning over multiple modalities, especially when the modalities interact and relate to each other to form cross-modal context? To better understand this problem, we study FMs on cross-modal conflicts: scenarios where conflicting evidence is presented across modalities. This allows us to examine whether FMs prioritize one modality over another or reason jointly to reconcile the conflict. Our experiments reveal that FMs can recognize conflicts in unimodal contexts, composed of a single modality, 90% of the time, but the ratio falls as low as 3% when evidence is split across modalities -- similar observations hold in cross-lingual contexts, composed of multiple languages. We trace this failure to cross-modal attention imbalance, showing that FMs exhibit extreme asymmetry in attention scores, disproportionately prioritizing certain modalities. We show that cross-modal attention imbalance does not go away by simply scaling up multimodal or multilingual datasets blindly, since they lack training examples that explicitly require cross-modal reasoning. We demonstrate that even a simple and scalable method of explicitly combining multiple modalities within each training instance significantly reduces attention imbalance. Reduced attention imbalance directly translates to improved downstream performance on several vision-language benchmarks. Our findings underscore the importance of systematically addressing cross-modal contexts to build reliable foundation models.

Paper Structure

This paper contains 31 sections, 2 equations, 18 figures.

Figures (18)

  • Figure 1: FM-based agents need to reason over diverse modalities, such as multilingual news, online shopping websites, maps, and EHR records. Failure to handle cross-modal context can result in consequences including misinformation (orange), purchasing a scam (yellow), misdirection (blue), or even providing the wrong medical treatment (light blue).
  • Figure 2: An illustration for cross-modal attention imbalance. In unimodal contexts (A), different domains show balanced normalized attention ($\mathrm{softmax}(QK^\top)$) despite divergent pre-softmax logits ($QK^\top$). Cross-modal contexts (B) expose cross-modal attention imbalance -- normalization fails to mitigate logit-level imbalance. Instance-level modality mixing (C) resolves this by training models to intrinsically balance attention logits across modalities.
  • Figure 3: FMs are worse at reasoning over cross-modal contexts than unimodal contexts.
  • Figure 4: Ablation studies on the prompt. FMs are worse at reasoning over cross-modal contexts than unimodal contexts. See the text for details of each prompt.
  • Figure 5: Cross-modal attention imbalance. English has larger attention than Chinese and images.
  • ...and 13 more figures