Table of Contents
Fetching ...

They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

Salma Abdel Magid, Jui-Hsien Wang, Kushal Kafle, Hanspeter Pfister

TL;DR

This work tackles associative bias in CLIP-based vision-language models by generating large-scale, diverse synthetic counterfactual images and captions to debias profession-related queries. It introduces a fully automatic pipeline that masks human regions, inpaints counterfactual appearances, and trains CLIP with a combined loss that enforces cohesion among counterfactual variants, i.e., $L = \beta_1 L_{CLIP} + \beta_0 L_{cf}$. A weight-space ensembling approach allows users to trade accuracy for fairness, preserving compatibility with the original CLIP. Empirical results on FairFace and PATA show substantial improvements in fairness metrics (MaxSkew, NDKL, and recall for worst groups) while maintaining competitive downstream performance on FlickrR@5 and ImageNet1K. The work highlights the practical potential of synthetic, privacy-preserving data for debiasing VLMs, with clear avenues for extending scope beyond professions and addressing synthetic-model biases.

Abstract

Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_α$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.

They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

TL;DR

This work tackles associative bias in CLIP-based vision-language models by generating large-scale, diverse synthetic counterfactual images and captions to debias profession-related queries. It introduces a fully automatic pipeline that masks human regions, inpaints counterfactual appearances, and trains CLIP with a combined loss that enforces cohesion among counterfactual variants, i.e., . A weight-space ensembling approach allows users to trade accuracy for fairness, preserving compatibility with the original CLIP. Empirical results on FairFace and PATA show substantial improvements in fairness metrics (MaxSkew, NDKL, and recall for worst groups) while maintaining competitive downstream performance on FlickrR@5 and ImageNet1K. The work highlights the practical potential of synthetic, privacy-preserving data for debiasing VLMs, with clear avenues for extending scope beyond professions and addressing synthetic-model biases.

Abstract

Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, , improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.
Paper Structure (27 sections, 3 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 27 sections, 3 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: After training on our counterfactual image dataset, a fine-tuned CLIP ViT-B/16 model can retrieve more uniform image distributions for different races and gender for query "flight attendent" on the FairFace dataset.
  • Figure 2: Synthetic counterfactual generation overview For each base textual keyword such as the profession "software developer", we first use LLMs to generate a set of plausible captions. Each caption then gets sampled with additional decorator inputs to generate a set of base images. We then compute masks that correspond to human body parts, and finally inpaint with decorators to synthesize counterfactuals. This pipeline enables us to generate a large amount of counterfactuals with diverse humans, while controlling visual cues that might cause secondary association for the downstream training tasks, such as the background or a prop a subject is holding.
  • Figure 3: Counterfactual loss, $\mathcal{L}_{cf}$, is added to additionally contrast counterfactual images of the same base image; typical text-to-image and image-to-text contrastive loss, $\mathcal{L}_{\text{CLIP}}$, is also used during our training. Similar to tian2024stablerep and mu2022slip. We use random cropping to augment data similar to SimCLR chen2020simple.
  • Figure 4: The gender similarity bias measured on PATA. We visualize the similarity biases on the top 20 professions. indicates the profession is biased towards men and indicates the profession is biased towards women. Our framework mitigates gender bias for a variety of occupations.
  • Figure 5: The fairness and accuracy tradeoff as we vary $\alpha$ in weight space ensembling. Fairness is measured as the $MaxSkew@1k$ for FairFace and accuracy is measured with the Flickr30k (left) and Imagenet1k (right) datasets.