Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota; Ryo Hachiuma; Boyi Li; Ximing Lu; Michael Ross Boone; Boris Ivanovic; Yejin Choi; Marco Pavone; Yu-Chiang Frank Wang; Noa Garcia; Yuta Nakashima; Chao-Han Huck Yang

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang

TL;DR

This work shows that non-gender features such as color, objects, and backgrounds can spuriously correlate with gender in standard benchmarks, leading to biased evaluations of vision-language models. By perturbing these features and measuring the resulting changes in bias scores, the authors demonstrate that even small changes can dramatically alter metrics like YGap and MaxSkew, compromising the reliability of current bias assessments. They provide both a theoretical causal framework and extensive experiments across generative VLMs and CLIP variants, revealing that bias measurements can reflect responses to spurious features more than true gender bias. The paper advocates reporting bias metrics together with feature-sensitivity measurements and proposes a practical two-dimensional evaluation approach, improving the robustness and interpretability of gender-bias evaluations in real-world benchmarks.

Abstract

Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

TL;DR

Abstract

Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)