Table of Contents
Fetching ...

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang

TL;DR

This work shows that non-gender features such as color, objects, and backgrounds can spuriously correlate with gender in standard benchmarks, leading to biased evaluations of vision-language models. By perturbing these features and measuring the resulting changes in bias scores, the authors demonstrate that even small changes can dramatically alter metrics like YGap and MaxSkew, compromising the reliability of current bias assessments. They provide both a theoretical causal framework and extensive experiments across generative VLMs and CLIP variants, revealing that bias measurements can reflect responses to spurious features more than true gender bias. The paper advocates reporting bias metrics together with feature-sensitivity measurements and proposes a practical two-dimensional evaluation approach, improving the robustness and interpretability of gender-bias evaluations in real-world benchmarks.

Abstract

Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

TL;DR

This work shows that non-gender features such as color, objects, and backgrounds can spuriously correlate with gender in standard benchmarks, leading to biased evaluations of vision-language models. By perturbing these features and measuring the resulting changes in bias scores, the authors demonstrate that even small changes can dramatically alter metrics like YGap and MaxSkew, compromising the reliability of current bias assessments. They provide both a theoretical causal framework and extensive experiments across generative VLMs and CLIP variants, revealing that bias measurements can reflect responses to spurious features more than true gender bias. The paper advocates reporting bias metrics together with feature-sensitivity measurements and proposes a practical two-dimensional evaluation approach, improving the robustness and interpretability of gender-bias evaluations in real-world benchmarks.

Abstract

Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

Paper Structure

This paper contains 27 sections, 11 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We investigate how non-gender features affect gender bias evaluations in VLMs through feature-perturbation analysis, revealing that measured biases are highly sensitive to spurious features, compromising the validity of direct gender bias assessments. Left: Simplified causal graph of gender bias evaluation, illustrating how spurious non-gender features ($B$) can influence the relationship between gender ($G$) and model outputs ($O$). Right: Our perturbation analysis on these spurious features (e.g., shifting hues or randomly masking object) shows that even small modifications affect model predictions ($B \rightarrow O$), thus obscuring the true gender bias ($G \rightarrow O$) we aim to measure.
  • Figure 2: Examples of feature-extracted inputs (i.e., $I_b$). Note that $I_{\text{object}}$ is a multi-hot representation of the detected objects.
  • Figure 3: Examples of the feature-perturbed images and the predictions of LLaVA-1.5-7B for the original and modified images.
  • Figure 4: Spurious correlation strength ($\text{Acc}_b$ in \ref{['tab:conf-detection-conv']}) vs. relative difference $\Delta$ for generative VLMs (left) and CLIP variants (right). The dashed line shows the correlation, demonstrating that stronger spurious correlations tend to cause larger shifts in bias measurements.
  • Figure 5: Top-$10$ retrieved images by SigLIP-ViT-S/14 for the prompt "A photo of an unkind person" on original and background blurred images (weak perturbation). Green-bordered pairs indicate images retrieved in both sets. The minimal overlap (only one shared image) highlights the model's sensitivity to background changes.
  • ...and 6 more figures