Table of Contents
Fetching ...

VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

TL;DR

VIGNETTE introduces a socially grounded, VQA-based bias evaluation for vision-language models, addressing gaps in prior bias studies by using activity-grounded, paired-scenario images across eight identity dimensions and 75 activities. The benchmark comprises 30M+ synthetic images and four QA paradigms—factuality, perception, stereotyping, and decision making—to analyze trait-level inferences and downstream decisions. The authors propose four bias metrics and perform large-scale analysis across three state-of-the-art VLMs, revealing structured, context-dependent biases that vary by identity, activity, and model architecture, with cross-model differences in factual grounding and perception. They release data, prompts, and code to enable expansive, transparent bias research and to inform responsible VLM design, while acknowledging limitations of synthetic data, visual-only identities, and generalization. Overall, VIGNETTE provides a comprehensive, multi-faceted framework to quantify and understand socially grounded biases in multimodal models, with implications for fairness, interpretability, and model development.

Abstract

While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

TL;DR

VIGNETTE introduces a socially grounded, VQA-based bias evaluation for vision-language models, addressing gaps in prior bias studies by using activity-grounded, paired-scenario images across eight identity dimensions and 75 activities. The benchmark comprises 30M+ synthetic images and four QA paradigms—factuality, perception, stereotyping, and decision making—to analyze trait-level inferences and downstream decisions. The authors propose four bias metrics and perform large-scale analysis across three state-of-the-art VLMs, revealing structured, context-dependent biases that vary by identity, activity, and model architecture, with cross-model differences in factual grounding and perception. They release data, prompts, and code to enable expansive, transparent bias research and to inform responsible VLM design, while acknowledging limitations of synthetic data, visual-only identities, and generalization. Overall, VIGNETTE provides a comprehensive, multi-faceted framework to quantify and understand socially grounded biases in multimodal models, with implications for fairness, interpretability, and model development.

Abstract

While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

Paper Structure

This paper contains 45 sections, 4 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Proposed VQA framework with 4 paradigms: factuality, perception, stereotype, and decision-making.
  • Figure 2: Pairwise comparison on struggle across Ability ( ). For instance, blind, when paired against a person with glasses, struggles more.
  • Figure 3: Asians observe consistent (left) vs. Europeans observe conflicting trends (right). ( )
  • Figure 4: Model comparisons show variability across factuality and stereotype, but are consistently biased for perception and decision-making. ($\uparrow$ = advantaged)
  • Figure 5: Models do not share the same bias trends. Perception shows higher bias across models; stereotype scores remain moderate. ($\uparrow$ = advantaged)
  • ...and 13 more figures