Table of Contents
Fetching ...

FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs

Sepehr Dehdashtian, Lan Wang, Vishnu Naresh Boddeti

TL;DR

FairerCLIP addresses biases in CLIP's zero-shot predictions by debiasing image and text representations in an RKHS using a Hilbert-Schmidt Independence Criterion (HSIC)–based dependence measure. It frames two bias types—intrinsic dependencies and spurious correlations—and optimizes a multi-term objective that reduces dependence on sensitive attributes $S$ while preserving target information $Y$ and aligning the debiased image/text representations. The solution leverages closed-form updates via generalized eigenvalue problems within an alternating optimization scheme, enabling faster training and better performance in data-constrained settings. Empirically, FairerCLIP improves fairness and maintains accuracy across Waterbirds, CelebA, FairFace, and CFD datasets, significantly reducing equal-opportunity and group-robustness gaps, with notable efficiency gains from random Fourier features. This kernel-based debiasing approach offers flexible training without mandatory ground-truth labels and scales well to medium-sized datasets, underscoring the potential of RKHS methods for bias mitigation in vision-language models.

Abstract

Large pre-trained vision-language models such as CLIP provide compact and general-purpose representations of text and images that are demonstrably effective across multiple downstream zero-shot prediction tasks. However, owing to the nature of their training process, these models have the potential to 1) propagate or amplify societal biases in the training data and 2) learn to rely on spurious features. This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations. We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing kernel Hilbert spaces (RKHSs), which affords multiple benefits: 1) Flexibility: Unlike existing approaches, which are specialized to either learn with or without ground-truth labels, FairerCLIP is adaptable to learning in both scenarios. 2) Ease of Optimization: FairerCLIP lends itself to an iterative optimization involving closed-form solvers, which leads to $4\times$-$10\times$ faster training than the existing methods. 3) Sample Efficiency: Under sample-limited conditions, FairerCLIP significantly outperforms baselines when they fail entirely. And, 4) Performance: Empirically, FairerCLIP achieves appreciable accuracy gains on benchmark fairness and spurious correlation datasets over their respective baselines.

FairerCLIP: Debiasing CLIP's Zero-Shot Predictions using Functions in RKHSs

TL;DR

FairerCLIP addresses biases in CLIP's zero-shot predictions by debiasing image and text representations in an RKHS using a Hilbert-Schmidt Independence Criterion (HSIC)–based dependence measure. It frames two bias types—intrinsic dependencies and spurious correlations—and optimizes a multi-term objective that reduces dependence on sensitive attributes while preserving target information and aligning the debiased image/text representations. The solution leverages closed-form updates via generalized eigenvalue problems within an alternating optimization scheme, enabling faster training and better performance in data-constrained settings. Empirically, FairerCLIP improves fairness and maintains accuracy across Waterbirds, CelebA, FairFace, and CFD datasets, significantly reducing equal-opportunity and group-robustness gaps, with notable efficiency gains from random Fourier features. This kernel-based debiasing approach offers flexible training without mandatory ground-truth labels and scales well to medium-sized datasets, underscoring the potential of RKHS methods for bias mitigation in vision-language models.

Abstract

Large pre-trained vision-language models such as CLIP provide compact and general-purpose representations of text and images that are demonstrably effective across multiple downstream zero-shot prediction tasks. However, owing to the nature of their training process, these models have the potential to 1) propagate or amplify societal biases in the training data and 2) learn to rely on spurious features. This paper proposes FairerCLIP, a general approach for making zero-shot predictions of CLIP more fair and robust to spurious correlations. We formulate the problem of jointly debiasing CLIP's image and text representations in reproducing kernel Hilbert spaces (RKHSs), which affords multiple benefits: 1) Flexibility: Unlike existing approaches, which are specialized to either learn with or without ground-truth labels, FairerCLIP is adaptable to learning in both scenarios. 2) Ease of Optimization: FairerCLIP lends itself to an iterative optimization involving closed-form solvers, which leads to - faster training than the existing methods. 3) Sample Efficiency: Under sample-limited conditions, FairerCLIP significantly outperforms baselines when they fail entirely. And, 4) Performance: Empirically, FairerCLIP achieves appreciable accuracy gains on benchmark fairness and spurious correlation datasets over their respective baselines.
Paper Structure (22 sections, 4 theorems, 26 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 4 theorems, 26 equations, 7 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

Let $\bm K_{X_I},\bm K_{X_T}\in \mathbb R^{n\times n}$ be the Gram matrices corresponding to $\mathcal{H}_{X_I}$ and $\mathcal{H}_{X_T}$, respectively, i.e., $\left(\bm K_{X_I}\right)_{ij}=k_{X_I}(\bm x_{I_i}, \bm x_{I_j})$ and $\left(\bm K_{X_T}\right)_{ij}=k_{X_T}(\bm x_{T_i}, \bm x_{T_j})$, where It follows that, the corresponding empirical estimator for $\text{Dep}\left(Z_I, Z_T\right)$ is

Figures (7)

  • Figure 1: Dependence graphs for debiasing.
  • Figure 2: Overview of the train and inference phases of FairerCLIP. (a) Shows the label prediction step. When labels are not available for training, FairerCLIP uses cosine similarity between the $X_{T}$ and $X_I$, and $X_{TS}$ and $X_I$ to predict the target attributes and sensitive attributes, respectively. (b) Shows the inputs and outputs for FairerCLIP in its training stage. FairerCLIP uses representation of images and the corresponding text prompts that are constructed by target attribute ($Y$) along with the predicted labels to find the image and text encoders, i.e., $\bm f^*_I(.; \bm \Theta_I)$ and $\bm f^*_T(.; \bm \Theta_T)$. (c) Shows the inference phase of FairerCLIP in which we use the trained image and text encoders to generate debiased representations from the ones generated by CLIP.
  • Figure 3: FairerCLIP acts on representations extracted from a frozen CLIP model. It has two mapping functions, $\bm f_I$ and $\bm f_T$, for the image and text representations. These functions are learned through an alternating optimization algorithm with two closed-form solvers. When ground-truth labels are unavailable for training, FairerCLIP learns from pseudo-labels $\hat{y}$, which are initialized from CLIP's zero-shot predictions and refined iteratively. The bold words in the input text prompts are the information of the target task label included in the text prompts.
  • Figure 4: A geometric illustration of FairerCLIP training steps. The encoder utilizes the implicit mapping functions $\phi_I(X)$ and $\phi_T(X)$ of the RBF kernel to map image and text features into an infinite-dimensional space, facilitating linear separability of samples with different target attributes. The optimization process seeks a direction that aligns with labels $Y$, statistically independent of $S$, and aligned with the other representation.
  • Figure 5: Results of FairerCLIP and baselines on CFD
  • ...and 2 more figures

Theorems & Definitions (9)

  • Lemma 1
  • proof
  • Definition 1
  • Theorem 2
  • proof
  • Lemma 1
  • proof
  • Theorem 2
  • proof