Interpretable Debiasing of Vision-Language Models for Social Fairness

Na Min An; Yoonna Jang; Yusuke Hirota; Ryo Hachiuma; Isabelle Augenstein; Hyunjung Shim

Interpretable Debiasing of Vision-Language Models for Social Fairness

Na Min An, Yoonna Jang, Yusuke Hirota, Ryo Hachiuma, Isabelle Augenstein, Hyunjung Shim

TL;DR

This work introduces an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders that effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge.

Abstract

The rapid advancement of Vision-Language models (VLMs) has raised growing concerns that their black-box reasoning processes could lead to unintended forms of social bias. Current debiasing approaches focus on mitigating surface-level bias signals through post-hoc learning or test-time algorithms, while leaving the internal dynamics of the model largely unexplored. In this work, we introduce an interpretable, model-agnostic bias mitigation framework, DeBiasLens, that localizes social attribute neurons in VLMs through sparse autoencoders (SAEs) applied to multimodal encoders. Building upon the disentanglement ability of SAEs, we train them on facial image or caption datasets without corresponding social attribute labels to uncover neurons highly responsive to specific demographics, including those that are underrepresented. By selectively deactivating the social neurons most strongly tied to bias for each group, we effectively mitigate socially biased behaviors of VLMs without degrading their semantic knowledge. Our research lays the groundwork for future auditing tools, prioritizing social fairness in emerging real-world AI systems.

Interpretable Debiasing of Vision-Language Models for Social Fairness

TL;DR

Abstract

Paper Structure (31 sections, 9 equations, 30 figures, 13 tables)

This paper contains 31 sections, 9 equations, 30 figures, 13 tables.

Introduction
Related Work
Bias Mitigation in Vision-Language Models
Mechanistic Interpretability
Methodology
SAE Training
Social Neuron Probing
Social Neuron-Modulated Inference
Experiments
Experimental Details
SAE Training datasets
Evaluation datasets
Evaluation metrics
Comparison methods
Implementation details
...and 16 more sections

Figures (30)

Figure 1: Social bias mitigation in VLMs. While existing models retrieve image distribution of skewed demographics or answer definitively on ambiguous image-text pairs, our DeBiasLens alleviates social biases across both image and text modalities.
Figure 2: Overview of our interpretable VLM debiasing framework.DeBiasLens consists of three stages: (1) SAE is trained on top of the last layer of the VLM image/text encoder (Section \ref{['sec:method1']}). (2) The social neurons are identified based on the consistency and specificity of SAE activations across data (Section \ref{['sec:method2']}). (3) The selected neurons are activated to generate debiased features, weighted summed with original features for further usage across downstream tasks (Section \ref{['sec:method3']}).
Figure 3: Comparison between similarity trend of facial image pairs for original and SAE-attached CLIP. The difference between the cosine similarity of random and social attribute-overlapping image pairs (G: gender, R: race, A: age) becomes more pronounced when our SAE is attached, indicating that the SAE can capture latent bias-sensitive features.
Figure 4: Comparison between bias mitigation vs. general performance of LVLMs. Our method achieves the best trade-off among existing approaches ($\leftarrow$, $\uparrow$, the better).
Figure 5: Neuron-specific results for CLIP text encoder. Modulating gender neurons mitigates only gender bias, indicating high neuron specificity.
...and 25 more figures

Interpretable Debiasing of Vision-Language Models for Social Fairness

TL;DR

Abstract

Interpretable Debiasing of Vision-Language Models for Social Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (30)