Table of Contents
Fetching ...

Joint Vision-Language Social Bias Removal for CLIP

Haoyu Zhang, Yangyang Guo, Mohan Kankanhalli

TL;DR

A novel V-L debiasing framework to align image and text biases followed by removing them from both modalities is proposed, which achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings.

Abstract

Vision-Language (V-L) pre-trained models such as CLIP show prominent capabilities in various downstream tasks. Despite this promise, V-L models are notoriously limited by their inherent social biases. A typical demonstration is that V-L models often produce biased predictions against specific groups of people, significantly undermining their real-world applicability. Existing approaches endeavor to mitigate the social bias problem in V-L models by removing biased attribute information from model embeddings. However, after our revisiting of these methods, we find that their bias removal is frequently accompanied by greatly compromised V-L alignment capabilities. We then reveal that this performance degradation stems from the unbalanced debiasing in image and text embeddings. To address this issue, we propose a novel V-L debiasing framework to align image and text biases followed by removing them from both modalities. By doing so, our method achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings. Additionally, we advocate a new evaluation protocol that can 1) holistically quantify the model debiasing and V-L alignment ability, and 2) evaluate the generalization of social bias removal models. We believe this work will offer new insights and guidance for future studies addressing the social bias problem in CLIP.

Joint Vision-Language Social Bias Removal for CLIP

TL;DR

A novel V-L debiasing framework to align image and text biases followed by removing them from both modalities is proposed, which achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings.

Abstract

Vision-Language (V-L) pre-trained models such as CLIP show prominent capabilities in various downstream tasks. Despite this promise, V-L models are notoriously limited by their inherent social biases. A typical demonstration is that V-L models often produce biased predictions against specific groups of people, significantly undermining their real-world applicability. Existing approaches endeavor to mitigate the social bias problem in V-L models by removing biased attribute information from model embeddings. However, after our revisiting of these methods, we find that their bias removal is frequently accompanied by greatly compromised V-L alignment capabilities. We then reveal that this performance degradation stems from the unbalanced debiasing in image and text embeddings. To address this issue, we propose a novel V-L debiasing framework to align image and text biases followed by removing them from both modalities. By doing so, our method achieves multi-modal bias mitigation while maintaining the V-L alignment in the debiased embeddings. Additionally, we advocate a new evaluation protocol that can 1) holistically quantify the model debiasing and V-L alignment ability, and 2) evaluate the generalization of social bias removal models. We believe this work will offer new insights and guidance for future studies addressing the social bias problem in CLIP.

Paper Structure

This paper contains 12 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Model results before and after removing social biases related to gender, race, and age, respectively.
  • Figure 2: Visualization of different social biases in the image (the top row) and text (the bottom row) embeddings through t-SNE. A fair model should embed different attributes (different symbols pertaining to one concept category) with respect to one concept (same color) close to each other.
  • Figure 3: Effective size results for text and image biases. Statistically significant (sig.) results are marked with dark blue and dark green colors. The $*$/$**$/$*$$*$$*$ implies p-values smaller than 0.1/0.05/0.01, respectively.
  • Figure 4: Overall pipeline. After obtaining the embedding of the given image, text, and counterfactual text using a frozen CLIP model, we first align the bias from both modalities with the help of two instantiated distributions. In addition, we design a counterfactual debiasing approach to bridge the embedding gap between two embeddings sharing the same concept yet with contrasting attributes.
  • Figure 5: T-SNE plot of the bias information $\psi(v_i)$ in sampled $v_i$, estimated by our bias alignment module before and after training.