Table of Contents
Fetching ...

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Pouya Pezeshkpour, Moin Aminnaseri, Estevam Hruschka

TL;DR

This work examines how vision-language models reason under conflicting visual and textual cues by constructing five benchmarks derived from VSR and Isobench to quantify modality bias across math, science, and visual description tasks. It evaluates five state-of-the-art VLMs on accuracy and F1, introducing a bias metric $B$ that captures the difference in image- versus text-favored responses, and tests three mitigation strategies: Verbalized Mitigation, Chain-of-Thought, and Decomposed Mitigation. The results show that bias depends on task difficulty and model scale, with simple queries favoring text and more complex ones shifting toward images; mitigation effectiveness is task- and model-dependent. These findings highlight the need for task- and model-aware approaches to make multimodal reasoning more reliable in real-world applications.

Abstract

Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

TL;DR

This work examines how vision-language models reason under conflicting visual and textual cues by constructing five benchmarks derived from VSR and Isobench to quantify modality bias across math, science, and visual description tasks. It evaluates five state-of-the-art VLMs on accuracy and F1, introducing a bias metric that captures the difference in image- versus text-favored responses, and tests three mitigation strategies: Verbalized Mitigation, Chain-of-Thought, and Decomposed Mitigation. The results show that bias depends on task difficulty and model scale, with simple queries favoring text and more complex ones shifting toward images; mitigation effectiveness is task- and model-dependent. These findings highlight the need for task- and model-aware approaches to make multimodal reasoning more reliable in real-world applications.

Abstract

Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.

Paper Structure

This paper contains 33 sections, 1 equation, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We investigate VLMs' bias toward text versus image inputs when mismatches occur between the modalities. Our observations reveal that this bias heavily depends on the task's difficulty. For example, while the model relies on textual representations to compute the roots of a degree-2 polynomial, increasing the degree to 3 shifts the reliance more toward the visual representation of the function.
  • Figure 2: We investigate the impact of three mitigation strategies---Verbalized, CoT, and Decomposed---on identifying mismatches in the input modalities.
  • Figure 3: The distribution of VLMs biases toward text versus image inputs.
  • Figure 4: The distribution of VLM biases toward text versus image in the VSR-based dataset. We report the per- spatial relationship break down of performance.
  • Figure 5: The accuracy of VLMs' internal perception of the simpler modality for solving the task is evaluated by comparing it to the actual modality each model relies on during problem solving.
  • ...and 5 more figures