Unveiling the "Fairness Seesaw": Discovering and Mitigating Gender and Race Bias in Vision-Language Models
Jian Lan, Udo Schlegel, Tanveer Hannan, Gengyuan Zhang, Haokun Chen, Thomas Seidl
TL;DR
This work investigates gender and race bias in Vision-Language Models by probing not only generated responses but also internal representations and confidence distributions. It uncovers the Fairness Seesaw, a phenomenon where fairness cues peak in intermediate layers while final layers and residual streams can reinforce bias, and introduces RES-FAIR as a post-hoc residual-flow adjustment to disentangle biased directions from fair ones. The approach rests on a subspace-decomposition framework that separates fair and biased directions in layer residuals and projects away bias while reinforcing fair components, with an uncertainty-aware training variant for comparison. Empirical results on PAIRS and SocialCounterfactuals show improved fairness and confidence calibration without sacrificing general reasoning, offering a principled path toward debiasing multimodal models in practice.
Abstract
Although Vision-Language Models (VLMs) have achieved remarkable success, the knowledge mechanisms underlying their social biases remain a black box, where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield gender and race bias in generative responses. In this paper, we conduct a systematic discovery of gender and race bias in state-of-the-art VLMs, focusing not only on surface-level responses but also on the internal probability distributions and hidden state dynamics. Our empirical analysis reveals three critical findings: 1) The Fairness Paradox: Models often generate fair text labels while maintaining highly skewed confidence scores (mis-calibration) toward specific social groups. 2) Layer-wise Fluctuation: Fairness knowledge is not uniformly distributed; it peaks in intermediate layers and undergoes substantial knowledge erosion in the final layers. 3) Residual Discrepancy: Within a single hidden layer, different residual streams carry conflicting social knowledge - some reinforcing fairness while others amplifying bias. Leveraging these insights, we propose RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components. Evaluations on PAIRS and SocialCounterfactuals datasets demonstrate that our discovery-based approach significantly improves response fairness and confidence calibration without compromising general reasoning abilities. Our work provides a new lens for understanding how multi-modal models store and process sensitive social information.
