Table of Contents
Fetching ...

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan

TL;DR

It is shown that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

Abstract

Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

TL;DR

It is shown that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

Abstract

Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.

Paper Structure

This paper contains 79 sections, 8 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Multimodal logical reasoning setup and evaluation pipeline.(a) Logical reasoning example: a single-step deduction where the fact "Bob is curious" and the rule "Curious people are purple" entail the conclusion "Bob is purple." (b) Modality renderings: the same fact is rendered as text ("Bob is curious"), as audio via neural TTS, and as a schematic visual using graph visualization. (c) Evaluation prompt pattern: the model receives modality-specific fact blocks (text, audio, vision), followed by the rule set and the question with multiple-choice options; the model outputs the predicted answer.
  • Figure 2: Attention probing and reasoning performance. (a) Modality probing for information usefulness shows moderate accuracy, suggesting models cannot clearly distinguish useful from distractor facts. (b) Although models excel in fact recognition and text-only reasoning, their performance drops significantly on multimodal reasoning, indicating that the key limitation lies in composing recognition and reasoning across modalities.
  • Figure 3: Modality probing based on attention patterns. (a) All models achieve perfect probe accuracy in predicting the modality using attention patterns. (b) For Qwen, linear probe weights show that modality information is primarily captured in the first four layers. (c) Attention manipulation in different 4 layers (by adjusting head temperature from 0.4 to 1.8), where performance significantly improves in the early 4 layers.
  • Figure 4: Prompt template and model (Qwen) output (Equivalence).
  • Figure 5: Prompt template and model (Qwen) output (Alternative).
  • ...and 7 more figures