Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Pouya Pezeshkpour; Moin Aminnaseri; Estevam Hruschka

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

Pouya Pezeshkpour, Moin Aminnaseri, Estevam Hruschka

TL;DR

This work examines how vision-language models reason under conflicting visual and textual cues by constructing five benchmarks derived from VSR and Isobench to quantify modality bias across math, science, and visual description tasks. It evaluates five state-of-the-art VLMs on accuracy and F1, introducing a bias metric $B$ that captures the difference in image- versus text-favored responses, and tests three mitigation strategies: Verbalized Mitigation, Chain-of-Thought, and Decomposed Mitigation. The results show that bias depends on task difficulty and model scale, with simple queries favoring text and more complex ones shifting toward images; mitigation effectiveness is task- and model-dependent. These findings highlight the need for task- and model-aware approaches to make multimodal reasoning more reliable in real-world applications.

Abstract

Vision-language models (VLMs) have demonstrated impressive performance by effectively integrating visual and textual information to solve complex tasks. However, it is not clear how these models reason over the visual and textual data together, nor how the flow of information between modalities is structured. In this paper, we examine how VLMs reason by analyzing their biases when confronted with scenarios that present conflicting image and text cues, a common occurrence in real-world applications. To uncover the extent and nature of these biases, we build upon existing benchmarks to create five datasets containing mismatched image-text pairs, covering topics in mathematics, science, and visual descriptions. Our analysis shows that VLMs favor text in simpler queries but shift toward images as query complexity increases. This bias correlates with model scale, with the difference between the percentage of image- and text-preferred responses ranging from +56.8% (image favored) to -74.4% (text favored), depending on the task and model. In addition, we explore three mitigation strategies: simple prompt modifications, modifications that explicitly instruct models on how to handle conflicting information (akin to chain-of-thought prompting), and a task decomposition strategy that analyzes each modality separately before combining their results. Our findings indicate that the effectiveness of these strategies in identifying and mitigating bias varies significantly and is closely linked to the model's overall performance on the task and the specific modality in question.

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

TL;DR

Abstract

Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)