Table of Contents
Fetching ...

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

Ailin Deng, Tri Cao, Zhirui Chen, Bryan Hooi

TL;DR

This work reveals a robust bias in vision-language systems where textual information is privileged over conflicting visual data, a phenomenon termed 'blind faith in text.' By introducing controlled text variations across four vision-centric tasks and evaluating ten diverse VLMs, the study shows that text corruption or relevance can severely degrade performance and pose safety risks. The authors dissect influencing factors (instructions, model size, text relevance, token order, uni-modal certainty) and demonstrate that supervised fine-tuning with text augmentation effectively mitigates the bias, albeit imperfectly. A theoretical analysis attributes the bias to imbalanced training data (predominantly pure text) relative to multi-modal data, underscoring the need for balanced, cross-modal training to improve robustness. Collectively, the results stress cautious deployment of cross-modal systems in real-world contexts and motivate further research into balanced multimodal pretraining and robust interaction mechanisms.

Abstract

Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a \emph{``blind faith in text''} phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

TL;DR

This work reveals a robust bias in vision-language systems where textual information is privileged over conflicting visual data, a phenomenon termed 'blind faith in text.' By introducing controlled text variations across four vision-centric tasks and evaluating ten diverse VLMs, the study shows that text corruption or relevance can severely degrade performance and pose safety risks. The authors dissect influencing factors (instructions, model size, text relevance, token order, uni-modal certainty) and demonstrate that supervised fine-tuning with text augmentation effectively mitigates the bias, albeit imperfectly. A theoretical analysis attributes the bias to imbalanced training data (predominantly pure text) relative to multi-modal data, underscoring the need for balanced, cross-modal training to improve robustness. Collectively, the results stress cautious deployment of cross-modal systems in real-world contexts and motivate further research into balanced multimodal pretraining and robust interaction mechanisms.

Abstract

Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks, but their handling of inconsistencies between modalities is underexplored. We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings. By introducing textual variations to four vision-centric tasks and evaluating ten Vision-Language Models (VLMs), we discover a \emph{``blind faith in text''} phenomenon: VLMs disproportionately trust textual data over visual data when inconsistencies arise, leading to significant performance drops under corrupted text and raising safety concerns. We analyze factors influencing this text bias, including instruction prompts, language model size, text relevance, token order, and the interplay between visual and textual certainty. While certain factors, such as scaling up the language model size, slightly mitigate text bias, others like token order can exacerbate it due to positional biases inherited from language models. To address this issue, we explore supervised fine-tuning with text augmentation and demonstrate its effectiveness in reducing text bias. Additionally, we provide a theoretical analysis suggesting that the blind faith in text phenomenon may stem from an imbalance of pure text and multi-modal data during training. Our findings highlight the need for balanced training and careful consideration of modality interactions in VLMs to enhance their robustness and reliability in handling multi-modal data inconsistencies.

Paper Structure

This paper contains 51 sections, 4 theorems, 28 equations, 8 figures, 12 tables.

Key Result

Theorem 5.1

(Informal; Theorem thm:main (simplified) ) Under certain assumptions, with probability at least $1-\delta$ the expected loss under pure-text data ${\mathbb{E}}_{{{(X,Y)\sim \mathcal{D}^{\rm txt}}}} \left[l(f_{\rm vlm}(X;\hat{\theta}_{\rm ERM}),Y) \right]$ achieves and similarly the expected loss under multi-modal data ${\mathbb{E}}_{{{(X,Y)\sim \mathcal{D}^{\rm mul}}}} \left[l(f_{\rm vlm}(X;\hat{

Figures (8)

  • Figure 1: Illustration of the “Blind Faith in Text” phenomenon in Vision-Language Models (VLMs). These models demonstrate a strong tendency to trust textual data, when it is inconsistent with the visual data or even incorrect.
  • Figure 2: Prompt for generating matched and corrupted text given an image, the question and the ground-truth answer. We substitute {question} and {answer} with the specific sample.
  • Figure 3: Model behaviors over different models when text is corrupted, matched or irrelevant.
  • Figure 4: Text Preference Ratio (TPR) of all models under different text variations. Most models exhibit high text preference bias when the textual information is relevant even if they are incorrect, especially for open models. Among the proprietary models, Claude-Sonnet exhibits the strongest robustness to corrupted text.
  • Figure 5: The effect of different factors (prompting, language model size, text relevance) on text bias. Left: Instructional prompts influence modality preference slightly; text preference drops from $16.8\%$ to $14.2\%$ with "Focus on Image" vs. "Focus on Text" in QwenVL-2-7B. Middle: Scaling the language models (7B, 13B, 34B) in LLaVA-NeXT models decreases text bias but only marginally. Right: Increasing text relevance to the query with BM25 retrieval, raises text bias.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem 5.1
  • Remark 5.2
  • Theorem A.5
  • Lemma A.6
  • Lemma A.7
  • proof : Proof of Theorem \ref{['thm:main']}