Table of Contents
Fetching ...

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

Xinhan Zheng, Huyu Wu, Xueting Wang, Haiyun Jiang

TL;DR

The paper investigates why multimodal LLMs preferentially rely on textual inputs, proposing that the bias arises from intrinsic architecture rather than data factors. By extracting Key Vectors from decoder layers in LLaVA-1.5-7B and Qwen2.5-VL-7B and using PCA+$t$-SNE alongside MMD and Jensen-Shannon divergence, the authors demonstrate that Visual Keys and Text Keys occupy distinct subspaces in the attention space, with inter-modal divergence far exceeding intra-modal variation. Findings show that cross-modality representations form separated manifolds, and the bias is strongest in simpler adapters (LLaVA) while persisting in Qwen, indicating an architectural cause. The work suggests remediation through cross-modal K-space alignment rather than solely data balancing, informing the design of more balanced, interpretable multimodal systems that can reason effectively from visual evidence.

Abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

Unveiling Intrinsic Text Bias in Multimodal Large Language Models through Attention Key-Space Analysis

TL;DR

The paper investigates why multimodal LLMs preferentially rely on textual inputs, proposing that the bias arises from intrinsic architecture rather than data factors. By extracting Key Vectors from decoder layers in LLaVA-1.5-7B and Qwen2.5-VL-7B and using PCA+-SNE alongside MMD and Jensen-Shannon divergence, the authors demonstrate that Visual Keys and Text Keys occupy distinct subspaces in the attention space, with inter-modal divergence far exceeding intra-modal variation. Findings show that cross-modality representations form separated manifolds, and the bias is strongest in simpler adapters (LLaVA) while persisting in Qwen, indicating an architectural cause. The work suggests remediation through cross-modal K-space alignment rather than solely data balancing, informing the design of more balanced, interpretable multimodal systems that can reason effectively from visual evidence.

Abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative (Jensen-Shannon divergence) methods. The results provide direct evidence that visual and textual keys occupy markedly distinct subspaces within the attention space. The inter-modal divergence is statistically significant, exceeding intra-modal variation by several orders of magnitude. These findings reveal that text bias arises from an intrinsic misalignment within the attention key space rather than solely from external data factors.

Paper Structure

This paper contains 10 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: t-SNE projections reorganized into a $2 \times 6$ matrix. The top row shows LLaVA-1.5-7B results, and the bottom row shows Qwen2.5-VL-7B results. Columns 1-3 correspond to MMBench-CN (Early, Middle, Late layers), and Columns 4-6 correspond to MMMU (Early, Middle, Late layers).
  • Figure 2: Distribution Differences of Inter-Modal K-Vectors based on MMD and JS Divergence. Results are aggregated across all selected layers and two benchmarks (MMBench-CN, MMMU) for both LLaVA and Qwen models. The significant separation between the 'Image V.S. Text' boxes and the control groups ('Image V.S. Image', 'Text V.S. Text') confirms the strong geometric modality gap.