BBQ-V: Benchmarking Visual Stereotype Bias in Large Multimodal Models
Vishal Narnaware, Ashmal Vayani, Rohit Gupta, Sirnam Swetha, Mubarak Shah
TL;DR
BBQ-V introduces a real-image, open-ended bias benchmark for large multimodal models, spanning nine social-bias domains across 50 sub-domains with over 14k visually grounded VQA items. It leverages an LLM-as-judge framework to assess fairness, stereotype reliance, prior bias, ambiguity, and grounding fidelity, supplemented by disambiguated accuracy measurements and human validation. The study reveals that incorporating visual input tends to amplify bias relative to text-only baselines, and that model scaling improves bias robustness, though ambiguity handling remains challenging. It further demonstrates the superiority of real-world imagery over synthetic data for reliable bias measurement, and provides public data and code to advance fairer, more transparent vision-language systems.
Abstract
Stereotype biases in Large Multimodal Models (LMMs) perpetuate harmful societal prejudices, undermining the fairness and equity of AI applications. As LMMs grow increasingly influential, addressing and mitigating inherent biases related to stereotypes, harmful generations, and ambiguous assumptions in real-world scenarios has become essential. However, existing datasets evaluating stereotype biases in LMMs often lack diversity, rely on synthetic images, and often have single-actor images, leaving a gap in bias evaluation for real-world visual contexts. To address the gap in bias evaluation using real images, we introduce the BBQ-Vision (BBQ-V), the most comprehensive framework for assessing stereotype biases across nine diverse categories and 50 sub-categories with real and multi-actor images. BBQ-V benchmark contains 14,144 image-question pairs and rigorously evaluates LMMs through carefully curated, visually grounded scenarios, challenging them to reason accurately about visual stereotypes. It offers a robust evaluation framework featuring real-world visual samples, image variations, and open-ended question formats. BBQ-V enables a precise and nuanced assessment of a model's reasoning capabilities across varying levels of difficulty. Through rigorous testing of 19 state-of-the-art open-source (general-purpose and reasoning) and closed-source LMMs, we highlight that these top-performing models are often biased on several social stereotypes, and demonstrate that the thinking models induce more bias in the reasoning chains. This benchmark represents a significant step toward fostering fairness in AI systems and reducing harmful biases, laying the groundwork for more equitable and socially responsible LMMs. Our dataset and evaluation code are publicly available.
