Table of Contents
Fetching ...

BBQ-V: Benchmarking Visual Stereotype Bias in Large Multimodal Models

Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, Mubarak Shah

TL;DR

BBQ-V introduces a real-image, open-ended bias benchmark for large multimodal models, spanning nine social-bias domains across 50 sub-domains with over 14k visually grounded VQA items. It leverages an LLM-as-judge framework to assess fairness, stereotype reliance, prior bias, ambiguity, and grounding fidelity, supplemented by disambiguated accuracy measurements and human validation. The study reveals that incorporating visual input tends to amplify bias relative to text-only baselines, and that model scaling improves bias robustness, though ambiguity handling remains challenging. It further demonstrates the superiority of real-world imagery over synthetic data for reliable bias measurement, and provides public data and code to advance fairer, more transparent vision-language systems.

Abstract

Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity, rely on synthetic images, and often have single-actor images, leaving a gap in bias evaluation for real-world visual contexts. To address the gap in bias evaluation using real images, we introduce the BBQ-Vision (BBQ-V), the most comprehensive framework for assessing stereotype biases across nine diverse categories and 50 sub-categories with real and multi-actor images. BBQ-V benchmark contains 14,144 image-question pairs and rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and open-ended question formats. BBQ-V enables a precise and nuanced assessment of a model's reasoning capabilities across varying levels of difficulty. Through rigorous testing of 19 state-of-the-art open-source (general-purpose and reasoning) and closed-source LMMs, we highlight that these top-performing models are often biased on several social stereotypes, and demonstrate that the thinking models induce more bias in the reasoning chains. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our dataset and evaluation code are publicly available.

BBQ-V: Benchmarking Visual Stereotype Bias in Large Multimodal Models

TL;DR

BBQ-V introduces a real-image, open-ended bias benchmark for large multimodal models, spanning nine social-bias domains across 50 sub-domains with over 14k visually grounded VQA items. It leverages an LLM-as-judge framework to assess fairness, stereotype reliance, prior bias, ambiguity, and grounding fidelity, supplemented by disambiguated accuracy measurements and human validation. The study reveals that incorporating visual input tends to amplify bias relative to text-only baselines, and that model scaling improves bias robustness, though ambiguity handling remains challenging. It further demonstrates the superiority of real-world imagery over synthetic data for reliable bias measurement, and provides public data and code to advance fairer, more transparent vision-language systems.

Abstract

Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity, rely on synthetic images, and often have single-actor images, leaving a gap in bias evaluation for real-world visual contexts. To address the gap in bias evaluation using real images, we introduce the BBQ-Vision (BBQ-V), the most comprehensive framework for assessing stereotype biases across nine diverse categories and 50 sub-categories with real and multi-actor images. BBQ-V benchmark contains 14,144 image-question pairs and rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and open-ended question formats. BBQ-V enables a precise and nuanced assessment of a model's reasoning capabilities across varying levels of difficulty. Through rigorous testing of 19 state-of-the-art open-source (general-purpose and reasoning) and closed-source LMMs, we highlight that these top-performing models are often biased on several social stereotypes, and demonstrate that the thinking models induce more bias in the reasoning chains. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our dataset and evaluation code are publicly available.

Paper Structure

This paper contains 47 sections, 25 figures, 10 tables.

Figures (25)

  • Figure 1: The BBQ-V benchmark includes nine diverse domains and 50 sub-domains to rigorously assess the performance of LMMs in visually grounded stereotypical scenarios. BBQ-V comprises over 14.1k carefully curated real-world, and multi-actor VQA pairs.
  • Figure 2: BBQ-V Data Curation Pipeline: Our benchmark incorporates ambiguous contexts and bias-probing questions from the BBQ parrish2021bbq dataset. The ambiguous text context is passed to a Visual Query Generator (VQG), which simplifies it into a search-friendly query to retrieve real-world images from the web. Retrieved images are filtered through a three-stage process: (1) PaddleOCR is used to eliminate text-heavy images; (2) semantic alignment is verified using CLIP, Qwen2.5-VL, and GPT-4o-mini to ensure the image matches the simplified context; and (3) synthetic and cartoon-like images are removed using GPT-4o-mini. A Visual Information Remover (VIR) anonymizes text references to prevent explicit leakage. The processed visual content is first blurred to remove PIDs (e.g., faces, watermarks) and then paired with the original bias-probing question to construct the multimodal bias evaluation benchmark.
  • Figure 3: We present qualitative examples from proprietary (top-row), open-source (middle-row), and thinking models (bottom-row), showcasing failure cases across various stereotype categories in BBQ-V. We highlight that models often rely on stereotypical associations to make definitive responses. For instance, Qwen3-VL-Thinking (bottom-left) infers household responsibility on men due to his traditional attire, and Gemini-2.0-flash (top-right) assumes a secretary is often female, both reflecting bias-driven reasoning rather than grounded inference. These examples highlight how current LMMs tend to amplify social stereotypes when interpreting ambiguous scenarios.
  • Figure 4: The figure illustrates the bias difference between LMMs and their corresponding LLM counterparts, showing that LMMs exhibit higher bias than their base LLMs.
  • Figure 5: Impact of model scaling on the results. The figure shows the scaling results across various LMM families on individual stereotype bias categories. GPT-4o variants exhibit the highest scores with the model scale.
  • ...and 20 more figures