Table of Contents
Fetching ...

Vision Language Models are Biased

An Vo, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, Daeyoung Kim

TL;DR

VLMBias provides a neutral-prompt, counterfactual benchmark to quantify biases in vision-language models across seven domains, revealing that memorized priors and background cues dominate counting and identification tasks. The framework shows global bias persists despite multiple models and even with reasoning-enabled variants; removing backgrounds and enhancing localization help, but gains are modest and conditional on tool use. The work highlights the limitations of current VLMs in visual reasoning under counterfactuals, underscores the value of bias-rate metrics, and points to promising directions such as localization tools and targeted debiasing to improve reliability in multimodal reasoning.

Abstract

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

Vision Language Models are Biased

TL;DR

VLMBias provides a neutral-prompt, counterfactual benchmark to quantify biases in vision-language models across seven domains, revealing that memorized priors and background cues dominate counting and identification tasks. The framework shows global bias persists despite multiple models and even with reasoning-enabled variants; removing backgrounds and enhancing localization help, but gains are modest and conditional on tool use. The work highlights the limitations of current VLMs in visual reasoning under counterfactuals, underscores the value of bias-rate metrics, and points to promising directions such as localization tools and targeted debiasing to improve reliability in multimodal reasoning.

Abstract

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that helps them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g., unable to recognize the 4th stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Removing image backgrounds nearly doubles accuracy (21.09 percentage points), revealing that contextual visual cues trigger these biased responses. Further analysis of VLMs' reasoning patterns shows that counting accuracy initially rises with thinking tokens, reaching ~40%, before declining with excessive reasoning. Our work presents an interesting failure mode in VLMs and a human-supervised automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

Paper Structure

This paper contains 83 sections, 47 figures, 35 tables.

Figures (47)

  • Figure 1: VLMs fail on 6 counting tasks (a--e & g) and one low-level vision task (f).
  • Figure 2: Given a subject (e.g., Adidas logo), we first confirm that all VLMs have sufficient knowledge about the subject via an ID and counting sanity-check questions (a). Then, we test VLMs on the counterfactual image (b) and report its accuracy on the counting (Q1 & Q2) and an Y/N identification task (Q3). For all tasks, we test the hypothesis that the visual bias cues in the background (c) may be so strong that they cause VLMs to ignore the anomalous details and default to biased answers.
  • Figure 3: VLMs fail to detect subtle changes in counterfactuals (CF) and default to biased answers.
  • Figure 4: On the counterfactual images of VLMBias, five VLMs mostly output answers that match the biased choices that we predefine for each question, 75.70% of the time.
  • Figure 5: VLMs perform poorly at counting elements on counterfactual images across , , and domains, often defaulting to the biased answers.
  • ...and 42 more figures