Table of Contents
Fetching ...

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, Lijie Hu

TL;DR

This work reframes modality following in multimodal LLMs as a dynamic outcome governed by relative unimodal reasoning uncertainty and an intrinsic modality preference. By constructing a controllable dataset and using output-token entropy to quantify unimodal uncertainty, the authors demonstrate a universal monotonic law: the probability of following a modality decreases as relative uncertainty increases, with a balance point capturing inherent bias. The study reveals an internal mechanism of oscillation: in ambiguous regions near the balance point, layer-wise predictions vacillate between text- and vision-supported answers, explaining externally observed indecision. These insights disentangle unimodal capabilities from stable preferences, offering a principled framework for understanding and improving how MLLMs resolve conflicting information and for designing more robust multimodal reasoning systems.

Abstract

Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model's inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

TL;DR

This work reframes modality following in multimodal LLMs as a dynamic outcome governed by relative unimodal reasoning uncertainty and an intrinsic modality preference. By constructing a controllable dataset and using output-token entropy to quantify unimodal uncertainty, the authors demonstrate a universal monotonic law: the probability of following a modality decreases as relative uncertainty increases, with a balance point capturing inherent bias. The study reveals an internal mechanism of oscillation: in ambiguous regions near the balance point, layer-wise predictions vacillate between text- and vision-supported answers, explaining externally observed indecision. These insights disentangle unimodal capabilities from stable preferences, offering a principled framework for understanding and improving how MLLMs resolve conflicting information and for designing more robust multimodal reasoning systems.

Abstract

Multimodal large language models (MLLMs) must resolve conflicts when different modalities provide contradictory information, a process we term modality following. Prior work measured this behavior only with coarse dataset-level statistics, overlooking the influence of model's confidence in unimodal reasoning. In this paper, we introduce a new framework that decomposes modality following into two fundamental factors: relative reasoning uncertainty (the case-specific confidence gap between unimodal predictions) and inherent modality preference( a model's stable bias when uncertainties are balanced). To validate this framework, we construct a controllable dataset that systematically varies the reasoning difficulty of visual and textual inputs. Using entropy as a fine-grained uncertainty metric, we uncover a universal law: the probability of following a modality decreases monotonically as its relative uncertainty increases. At the relative difficulty level where the model tends to follow both modalities with comparable probability what we call the balance point, a practical indicator of the model's inherent preference. Unlike traditional macro-level ratios, this measure offers a more principled and less confounded way to characterize modality bias, disentangling it from unimodal capabilities and dataset artifacts. Further, by probing layer-wise predictions, we reveal the internal mechanism of oscillation: in ambiguous regions near the balance point, models vacillate between modalities across layers, explaining externally observed indecision. Together, these findings establish relative uncertainty and inherent preference as the two governing principles of modality following, offering both a quantitative framework and mechanistic insight into how MLLMs resolve conflicting information.

Paper Structure

This paper contains 32 sections, 4 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of the analytical framework. (a) We create inputs with independently controllable visual ($d_v$) and textual ($d_t$) difficulty. (b) We measure the model's perceived uncertainty for each modality via output entropy ($H_v$, $H_t$). (c) We then use the relative uncertainty ($\Delta H_{rel}$) to analyze the model's choice when faced with a conflict.
  • Figure 2: Unimodal Entropy Trends Across Difficulty Tiers. Average unimodal entropy for text (left) and vision (right) as a function of our designed difficulty tiers. Across all models, entropy consistently increases with difficulty, validating its use as a proxy for model-perceived uncertainty and revealing differences in model capabilities.
  • Figure 3: Macro-level modality-following ratios and relative uncertainty distributions of model performance on the dataset.
  • Figure 4: The relationship between relative unimodal uncertainty ($\Delta H_{\text{rel}}$, x-axis) and the probability of following the text modality (Text Preference Ratio, y-axis) for various models.
  • Figure 5: A comparison of the average number of concept oscillations for different models. Across all models, the number of oscillations is significantly higher in the ambiguous region (patterned bars) than in the clear region (solid bars).
  • ...and 5 more figures