Table of Contents
Fetching ...

Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

Shaivi Malik, Hasnat Md Abdullah, Sriparna Saha, Amit Sheth

TL;DR

The paper proposes the GRAS Benchmark to quantify demographic biases in Vision-Language Models across gender, race, age, and skin tone using a scalable, demographically balanced dataset and five linguistically varied question templates. It introduces the GRAS Bias Score, an interpretable 0–100 metric that aggregates significant bias across 100 traits and four attributes, validated on five state-of-the-art VLMs. The results reveal pervasive, high-level biases that persist despite model quality, and demonstrate that bias measurements are highly sensitive to question formulation, underscoring the need for multi-formulation bias probing. By releasing the dataset, prompts, and code, the work enables reproducible bias evaluation and lays groundwork for targeted mitigation in vision-language systems.

Abstract

As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

TL;DR

The paper proposes the GRAS Benchmark to quantify demographic biases in Vision-Language Models across gender, race, age, and skin tone using a scalable, demographically balanced dataset and five linguistically varied question templates. It introduces the GRAS Bias Score, an interpretable 0–100 metric that aggregates significant bias across 100 traits and four attributes, validated on five state-of-the-art VLMs. The results reveal pervasive, high-level biases that persist despite model quality, and demonstrate that bias measurements are highly sensitive to question formulation, underscoring the need for multi-formulation bias probing. By releasing the dataset, prompts, and code, the work enables reproducible bias evaluation and lays groundwork for targeted mitigation in vision-language systems.

Abstract

As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

Paper Structure

This paper contains 18 sections, 1 equation, 52 figures, 9 tables.

Figures (52)

  • Figure 1: Qwen2.5-VL-3B-Instruct Qwen2VL gives different answers to the same image when asked five semantically equivalent questions, showing sensitivity to question formulation in VQA.
  • Figure 2: GRAS Benchmark: Overview of our benchmark for evaluating bias in vision language models across demographic attributes including gender, race, age, and skin tone. The GRAS Image Dataset consists of 5,010 images, representing 10 skin tone groups, 7 racial groups, 5 age groups, and 2 gender groups. The 10 skin tone groups are based on the Monk Skin Tone (MST) Scale developed by Google AI Monk_2019. Each question template introduces linguistic variation while preserving semantic equivalence. A VLM is prompted with 500 personality trait questions on 5,010 images, resulting in 2.5 million (image, trait, template) queries.
  • Figure 3: Template Sensitivity Analysis: Subplots (a–e) illustrate the variation in the mean of $P(\text{Yes} \mid \text{image}, \text{trait})$ across different templates for a given model.
  • Figure 4: Racial Bias: Each subplot shows the deviation of the mean of $P(\text{Yes} \mid \text{image}, \text{trait}, \text{template 2})$ for each racial group from the overall mean. Results for the full list of traits are provided in Appendix \ref{['sec:racial_plots']}.
  • Figure 5: Skin Tone Bias: Each subplot shows the deviation of the mean of $P(\text{Yes} \mid \text{image}, \text{trait}, \text{template 5})$ for each Monk Skin Tone group from the overall mean. Results for the full list of traits are provided in Appendix \ref{['sec:skin_plots']}.
  • ...and 47 more figures