Table of Contents
Fetching ...

FairRAG: Fair Human Generation via Fair Retrieval Augmentation

Robik Shrestha, Yang Zou, Qiuyu Chen, Zhiheng Li, Yusheng Xie, Siqi Deng

TL;DR

This work tackles the bias in diffusion-based human image generation by conditioning a frozen pre-trained backbone on externally retrieved, demographically diverse reference images. It introduces FairRAG, a lightweight framework with a linear conditioning module and a fair retrieval system that uses debiased queries and balanced sampling to enrich demographic representation. Empirical results show improved demographic diversity, better image-text alignment, and competitive image fidelity, all with minimal inference overhead. The approach is extensible to broader domains by expanding the external reference dataset and can be integrated with other retrieval-augmented generation strategies without retraining the backbone.

Abstract

Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work, we introduce Fair Retrieval Augmented Generation (FairRAG), a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation. FairRAG enables conditioning through a lightweight linear module that projects reference images into the textual space. To enhance fairness, FairRAG applies simple-yet-effective debiasing strategies, providing images from diverse demographic groups during the generative process. Extensive experiments demonstrate that FairRAG outperforms existing methods in terms of demographic diversity, image-text alignment, and image fidelity while incurring minimal computational overhead during inference.

FairRAG: Fair Human Generation via Fair Retrieval Augmentation

TL;DR

This work tackles the bias in diffusion-based human image generation by conditioning a frozen pre-trained backbone on externally retrieved, demographically diverse reference images. It introduces FairRAG, a lightweight framework with a linear conditioning module and a fair retrieval system that uses debiased queries and balanced sampling to enrich demographic representation. Empirical results show improved demographic diversity, better image-text alignment, and competitive image fidelity, all with minimal inference overhead. The approach is extensible to broader domains by expanding the external reference dataset and can be integrated with other retrieval-augmented generation strategies without retraining the backbone.

Abstract

Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work, we introduce Fair Retrieval Augmented Generation (FairRAG), a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation. FairRAG enables conditioning through a lightweight linear module that projects reference images into the textual space. To enhance fairness, FairRAG applies simple-yet-effective debiasing strategies, providing images from diverse demographic groups during the generative process. Extensive experiments demonstrate that FairRAG outperforms existing methods in terms of demographic diversity, image-text alignment, and image fidelity while incurring minimal computational overhead during inference.
Paper Structure (19 sections, 3 equations, 8 figures, 6 tables)

This paper contains 19 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The proposed () framework improves demographic diversity (fairness in image generation) by conditioning generative models on external human reference images. As defined in Eq. \ref{['eq:diversity']}, the diversity metric measures representation from different age, gender and skin tone groups.
  • Figure 2: We train the linear projector $\mathcal{H}(.)$ using a denoising loss on the latent space while keeping the backbone model frozen. To train $\mathcal{H}(.)$, we sample images uniformly from each demographic group, pairing each image with the prompt: Photo of a person.
  • Figure 3: During inference, constructs a debiased query to retrieve Top-$N$ candidates for a given prompt. Using their demographic group annotations, then selects a balanced set of $K$ images with high demographic diversity for conditioning. The full bimodal prompt consists of: a) the original user prompt, b) a transfer instruction and c) the projected visual reference token. This bimodal prompt is used within the cross-attention layers of the U-Net to condition the generative process.
  • Figure 4: Example outputs from different methods for the text prompt Photo of a computer programmer. Baseline methods, barring Text Augmentation, fail to produce images with high demographic diversity. improves demographic diversity with the help of external visual references. Apart from that, it also improves alignment and fidelity.
  • Figure 5: improves demographic diversity for different categories of professions.
  • ...and 3 more figures