Evaluating and Steering Modality Preferences in Multimodal Large Language Model
Yu Zhang, Jinlong Ma, Yongshuai Hou, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang
TL;DR
<3-5 sentence high-level summary> This paper investigates modality preference in multimodal large language models (MLLMs) using a controlled conflict-centric benchmark (MC2) to quantify whether models favor vision or text when faced with conflicting multimodal evidence. It finds a pervasive bias toward text across most models and shows that the direction of preference is encoded in latent representations, enabling a training-free probing and steering method based on representation engineering. The proposed probe-and-steer framework identifies a latent direction of modality preference and injects a scaled version of this direction during decoding to actively bias model outputs toward a chosen modality, without any fine-tuning. Empirical results demonstrate improved performance on downstream tasks such as multimodal machine translation and visual understanding, and the approach generalizes across multiple MLLMs, offering practical tools for reducing hallucinations and tailoring multimodal reasoning.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable performance on complex tasks with multimodal context. However, it is still understudied whether they exhibit modality preference when processing multimodal contexts. To study this question, we first build a \textbf{MC\textsuperscript{2}} benchmark under controlled evidence conflict scenarios to systematically evaluate modality preference, which is the tendency to favor one modality over another when making decisions based on multimodal conflicting evidence. Our extensive evaluation reveals that all 18 tested MLLMs generally demonstrate clear modality bias, and modality preference can be influenced by external interventions. An in-depth analysis reveals that the preference direction can be captured within the latent representations of MLLMs. Built on this, we propose a probing and steering method based on representation engineering to explicitly control modality preference without additional fine-tuning or carefully crafted prompts. Our method effectively amplifies modality preference toward a desired direction and applies to downstream tasks such as hallucination mitigation and multimodal machine translation, yielding promising improvements.
