Table of Contents
Fetching ...

Evaluating and Steering Modality Preferences in Multimodal Large Language Model

Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang

TL;DR

<3-5 sentence high-level summary> This paper investigates modality preference in multimodal large language models (MLLMs) using a controlled conflict-centric benchmark (MC2) to quantify whether models favor vision or text when faced with conflicting multimodal evidence. It finds a pervasive bias toward text across most models and shows that the direction of preference is encoded in latent representations, enabling a training-free probing and steering method based on representation engineering. The proposed probe-and-steer framework identifies a latent direction of modality preference and injects a scaled version of this direction during decoding to actively bias model outputs toward a chosen modality, without any fine-tuning. Empirical results demonstrate improved performance on downstream tasks such as multimodal machine translation and visual understanding, and the approach generalizes across multiple MLLMs, offering practical tools for reducing hallucinations and tailoring multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.

Evaluating and Steering Modality Preferences in Multimodal Large Language Model

TL;DR

<3-5 sentence high-level summary> This paper investigates modality preference in multimodal large language models (MLLMs) using a controlled conflict-centric benchmark (MC2) to quantify whether models favor vision or text when faced with conflicting multimodal evidence. It finds a pervasive bias toward text across most models and shows that the direction of preference is encoded in latent representations, enabling a training-free probing and steering method based on representation engineering. The proposed probe-and-steer framework identifies a latent direction of modality preference and injects a scaled version of this direction during decoding to actively bias model outputs toward a chosen modality, without any fine-tuning. Empirical results demonstrate improved performance on downstream tasks such as multimodal machine translation and visual understanding, and the approach generalizes across multiple MLLMs, offering practical tools for reducing hallucinations and tailoring multimodal reasoning.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.

Paper Structure

This paper contains 40 sections, 2 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 2: Results of modality preference across different MLLMs. Left: Quantified scores for Vision, OthersText modalities using $S_{vision}$, $S_{others}$ and $S_{text}$ as well as Vision Ratio. Right: Trends of Vision Ratio with respect to model parameter size across different MLLMs.
  • Figure 3: Analysis of modality preferences. Left: Trends of Vision Ratio and multimodal Attention Ratio across different models. Middle: Vision Ratio with Respect to the Proportion of Multimodal Conflict-Context Training Data and different MLLMs. Right: Relationship between visual understanding ability, quantified as the average accuracy across seven widely used benchmarks and the modality preference measured by Vision Ratio.
  • Figure 4: Analysis of modality preference under instruction-guidance. Left: Adjustment of modality preference using instruction-guided control (Inst-vision vs. Inst-text). Middle: Representation shifts under instruction-guided interventions. Right: layer-wise absolute difference and standard deviation of hidden states between different instruction.
  • Figure 5: Overall framework of the proposed method. Modality Preference Probing collects the neural activity, computes and scales the direction of modality preference. Modality Preference Steering selects the target layer during the second inference and adds the scaled modality preference direction to the representation at the corresponding layer at each inference step.
  • Figure 6: Illustration of using modality context conflict pairs to investigate modality preference in activity recognition (Left) and counting tasks (Right). The highlighted areas indicate the points of conflict between visual and textual contexts.
  • ...and 7 more figures