Table of Contents
Fetching ...

Unveiling the "Fairness Seesaw": Discovering and Mitigating Gender and Race Bias in Vision-Language Models

Jian Lan, Udo Schlegel, Tanveer Hannan, Gengyuan Zhang, Haokun Chen, Thomas Seidl

TL;DR

This work investigates gender and race bias in Vision-Language Models by probing not only generated responses but also internal representations and confidence distributions. It uncovers the Fairness Seesaw, a phenomenon where fairness cues peak in intermediate layers while final layers and residual streams can reinforce bias, and introduces RES-FAIR as a post-hoc residual-flow adjustment to disentangle biased directions from fair ones. The approach rests on a subspace-decomposition framework that separates fair and biased directions in layer residuals and projects away bias while reinforcing fair components, with an uncertainty-aware training variant for comparison. Empirical results on PAIRS and SocialCounterfactuals show improved fairness and confidence calibration without sacrificing general reasoning, offering a principled path toward debiasing multimodal models in practice.

Abstract

Although Vision-Language Models (VLMs) have achieved remarkable success, the knowledge mechanisms underlying their social biases remain a black box, where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield gender and race bias in generative responses. In this paper, we conduct a systematic discovery of gender and race bias in state-of-the-art VLMs, focusing not only on surface-level responses but also on the internal probability distributions and hidden state dynamics. Our empirical analysis reveals three critical findings: 1) The Fairness Paradox: Models often generate fair text labels while maintaining highly skewed confidence scores (mis-calibration) toward specific social groups. 2) Layer-wise Fluctuation: Fairness knowledge is not uniformly distributed; it peaks in intermediate layers and undergoes substantial knowledge erosion in the final layers. 3) Residual Discrepancy: Within a single hidden layer, different residual streams carry conflicting social knowledge - some reinforcing fairness while others amplifying bias. Leveraging these insights, we propose RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components. Evaluations on PAIRS and SocialCounterfactuals datasets demonstrate that our discovery-based approach significantly improves response fairness and confidence calibration without compromising general reasoning abilities. Our work provides a new lens for understanding how multi-modal models store and process sensitive social information.

Unveiling the "Fairness Seesaw": Discovering and Mitigating Gender and Race Bias in Vision-Language Models

TL;DR

This work investigates gender and race bias in Vision-Language Models by probing not only generated responses but also internal representations and confidence distributions. It uncovers the Fairness Seesaw, a phenomenon where fairness cues peak in intermediate layers while final layers and residual streams can reinforce bias, and introduces RES-FAIR as a post-hoc residual-flow adjustment to disentangle biased directions from fair ones. The approach rests on a subspace-decomposition framework that separates fair and biased directions in layer residuals and projects away bias while reinforcing fair components, with an uncertainty-aware training variant for comparison. Empirical results on PAIRS and SocialCounterfactuals show improved fairness and confidence calibration without sacrificing general reasoning, offering a principled path toward debiasing multimodal models in practice.

Abstract

Although Vision-Language Models (VLMs) have achieved remarkable success, the knowledge mechanisms underlying their social biases remain a black box, where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield gender and race bias in generative responses. In this paper, we conduct a systematic discovery of gender and race bias in state-of-the-art VLMs, focusing not only on surface-level responses but also on the internal probability distributions and hidden state dynamics. Our empirical analysis reveals three critical findings: 1) The Fairness Paradox: Models often generate fair text labels while maintaining highly skewed confidence scores (mis-calibration) toward specific social groups. 2) Layer-wise Fluctuation: Fairness knowledge is not uniformly distributed; it peaks in intermediate layers and undergoes substantial knowledge erosion in the final layers. 3) Residual Discrepancy: Within a single hidden layer, different residual streams carry conflicting social knowledge - some reinforcing fairness while others amplifying bias. Leveraging these insights, we propose RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components. Evaluations on PAIRS and SocialCounterfactuals datasets demonstrate that our discovery-based approach significantly improves response fairness and confidence calibration without compromising general reasoning abilities. Our work provides a new lens for understanding how multi-modal models store and process sensitive social information.

Paper Structure

This paper contains 32 sections, 19 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: A comparison showing previous evaluation problems on the left and improved MCS inputs with model generations on the right.
  • Figure 2: Our prompts and data examples. On the left-hand side, we show the system prompts. The process we design system prompts can be found in Appendix \ref{['appa2']}. On the right-hand side are one example of the input on the top, and some examples of our candidate templates at the bottom.
  • Figure 3: Layer residuals and the schematic diagram of our method in the latent space.
  • Figure 4: The proportions of responses in LLaVA-NeXT-13B and Qwen2.5-VL-32B, when investigated on PAIRS for gender. Each bar and each proportion are clearly labeled in the figure. The blue one is the original performance, while the orange one is our post-hoc method's.
  • Figure 5: Proportions between fairness-associated responses and race-biased responses. Results are reported on the strongest large models on race category from SCF.
  • ...and 6 more figures