GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

Yisong Xiao; Aishan Liu; QianJia Cheng; Zhenfei Yin; Siyuan Liang; Jiapeng Li; Jing Shao; Xianglong Liu; Dacheng Tao

GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

Yisong Xiao, Aishan Liu, QianJia Cheng, Zhenfei Yin, Siyuan Liang, Jiapeng Li, Jing Shao, Xianglong Liu, Dacheng Tao

TL;DR

GenderBias-VL tackles the problem of occupation-related gender bias in LVLMs by shifting the evaluation focus from group fairness to individual fairness. It introduces a large-scale benchmark built from diffusion-based image generation and counterfactual editing to create visual questions with gender-opposite options across 177 occupations, totaling 34,581 visual question counterfactuals. The authors quantify bias using $B_{pair}$, $B_{ovl}$, $Ipss$, and related metrics, and benchmark 15 open-source LVLMs plus two commercial APIs, revealing pervasive biases that correlate with real-world labor statistics and exhibit strong cross-modal alignment. The work provides an up-to-date leaderboard and a nuanced understanding of how visual and language biases co-occur in LVLMs, offering a practical, scalable resource for bias evaluation and mitigation in multimodal AI systems.

Abstract

Large Vision-Language Models (LVLMs) have been widely adopted in various applications; however, they exhibit significant gender biases. Existing benchmarks primarily evaluate gender bias at the demographic group level, neglecting individual fairness, which emphasizes equal treatment of similar individuals. This research gap limits the detection of discriminatory behaviors, as individual fairness offers a more granular examination of biases that group fairness may overlook. For the first time, this paper introduces the GenderBias-\emph{VL} benchmark to evaluate occupation-related gender bias in LVLMs using counterfactual visual questions under individual fairness criteria. To construct this benchmark, we first utilize text-to-image diffusion models to generate occupation images and their gender counterfactuals. Subsequently, we generate corresponding textual occupation options by identifying stereotyped occupation pairs with high semantic similarity but opposite gender proportions in real-world statistics. This method enables the creation of large-scale visual question counterfactuals to expose biases in LVLMs, applicable in both multimodal and unimodal contexts through modifying gender attributes in specific modalities. Overall, our GenderBias-\emph{VL} benchmark comprises 34,581 visual question counterfactual pairs, covering 177 occupations. Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs (\eg, LLaVA) and state-of-the-art commercial APIs, including GPT-4o and Gemini-Pro. Our findings reveal widespread gender biases in existing LVLMs. Our benchmark offers: (1) a comprehensive dataset for occupation-related gender bias evaluation; (2) an up-to-date leaderboard on LVLM biases; and (3) a nuanced understanding of the biases presented by these models. \footnote{The dataset and code are available at the \href{https://genderbiasvl.github.io/}{website}.}

GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

TL;DR

, and related metrics, and benchmark 15 open-source LVLMs plus two commercial APIs, revealing pervasive biases that correlate with real-world labor statistics and exhibit strong cross-modal alignment. The work provides an up-to-date leaderboard and a nuanced understanding of how visual and language biases co-occur in LVLMs, offering a practical, scalable resource for bias evaluation and mitigation in multimodal AI systems.

Abstract

Paper Structure (12 sections, 3 equations, 5 figures, 2 tables)

This paper contains 12 sections, 3 equations, 5 figures, 2 tables.

Introduction
Related Works
GenderBias-VL Benchmark
Terminology
Counterfactual Visual Question Pairs Construction Pipeline
Evaluation Protocol
Benchmark Evaluation Results
Bias Evaluation on Open-source LVLMs
Case Studies on Commercial LVLMs
Bias Characteristics Analysis
Relationship between Visual and Language Bias
Conclusion and Future Work

Figures (5)

Figure 1: Overview of GenderBias-VL. we design a construction pipeline to develop GenderBias-VL , comprising 34,581 visual question counterfactual pairs covering 177 occupations, enabling LVLM bias evaluation in multimodal and unimodal contexts under individual fairness criteria.
Figure 2: Distribution of LVLMs' bias ($B_{pair}$) under VL-Bias evaluation. Pairs exhibiting maximum biases (positive and negative) are plotted with occupation names. Some occupations are abbreviated.
Figure 3: Results (in %) of commercial LVLMs on Top-10 biased pairs. The polylines denote the average results across evaluation contexts.
Figure 4: Top-10 pairs with the most biased results. Some occupations are abbreviated.
Figure 5: Both results show occupation-level bias $B_{micro}$ of InstructBLIP. Scatter plots are divided into four colored quadrants, and points are colored to indicate male- or female-dominated occupations. (a) Occupation bias generally aligns with U.S. Labor Force Statistics. (b) Bias exhibits strong consistency in visual and language modal. Similar results for other LVLMs are in the Appendix.

GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

TL;DR

Abstract

GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing

Authors

TL;DR

Abstract

Table of Contents

Figures (5)