Fair Text-to-Image Diffusion via Fair Mapping

Jia Li; Lijie Hu; Jingfeng Zhang; Tianhang Zheng; Hua Zhang; Di Wang

Fair Text-to-Image Diffusion via Fair Mapping

Jia Li, Lijie Hu, Jingfeng Zhang, Tianhang Zheng, Hua Zhang, Di Wang

TL;DR

This work addresses demographically biased outputs in text-to-image diffusion models prompted by human-related descriptions. It introduces Fair Mapping, a lightweight, model-agnostic debiasing module that adds a linear mapping network after the text encoder and an inference-time detector to map conditioning embeddings into a debiased space, trained with $L_{text}$ and $L_{fair}$. Empirical results on face-generation tasks show reduced diffusion and language bias with minimal loss in image quality, along with improved alignment to human-related prompts and efficient training (e.g., an eight-layer mapping requiring modest compute). The approach is practical for real-world deployment due to its model-agnostic nature, low parameter overhead, and scalable fairness improvements.

Abstract

In this paper, we address the limitations of existing text-to-image diffusion models in generating demographically fair results when given human-related descriptions. These models often struggle to disentangle the target language context from sociocultural biases, resulting in biased image generation. To overcome this challenge, we propose Fair Mapping, a flexible, model-agnostic, and lightweight approach that modifies a pre-trained text-to-image diffusion model by controlling the prompt to achieve fair image generation. One key advantage of our approach is its high efficiency. It only requires updating an additional linear network with few parameters at a low computational cost. By developing a linear network that maps conditioning embeddings into a debiased space, we enable the generation of relatively balanced demographic results based on the specified text condition. With comprehensive experiments on face image generation, we show that our method significantly improves image generation fairness with almost the same image quality compared to conventional diffusion models when prompted with descriptions related to humans. By effectively addressing the issue of implicit language bias, our method produces more fair and diverse image outputs.

Fair Text-to-Image Diffusion via Fair Mapping

TL;DR

and

. Empirical results on face-generation tasks show reduced diffusion and language bias with minimal loss in image quality, along with improved alignment to human-related prompts and efficient training (e.g., an eight-layer mapping requiring modest compute). The approach is practical for real-world deployment due to its model-agnostic nature, low parameter overhead, and scalable fairness improvements.

Abstract

Paper Structure (22 sections, 14 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 14 equations, 12 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Language Bias in Text-to-Image Diffusion Models
Mitigating Implicit Bias via Fair Mapping
Training Fair Mapping Network
Inference
Experiments
Experimental Setup
Experimental Results
Ablation Study
Conclusions
Broader Impact
Details of the Inference Stage
Preliminary
Experimental Details
...and 7 more sections

Figures (12)

Figure 1: Fair Mapping (our method) can balance demographic visual images in text-to-image diffusion models. Fair Mapping minimally adjusts parameters during training to eliminate demographic biases in pre-trained text-to-image models, resulting in more equitable image generation. Here, Stable Diffusion (top row) runs the risk of lacking diversity in its output, e.g., only male-appearing persons generation as computer programmer and confident. In contrast, Fair Mapping (with different sensitive attributes) allows the creation of more equitable and unbiased images.
Figure 2: Language Bias and Diffusion Bias Visualization. We conduct a bias analysis of the language characteristics and the generated outcomes during the diffusion process. Left: Examples of language prejudice. Right: Language bias and diffusion bias for occupational data. Each point represents an occupation.
Figure 3: In the training stage, the parameters of the text encoder are frozen, and we apply $\mathcal{L}_{text}$ and $\mathcal{L}_{fair}$ to update Fair Mapping. $d_a$ denotes the distance between $v_a$ and $v$.
Figure 4: In the inference stage, the detector after the text encoder determines whether the text should pass or skip the Fair Mapping linear network.
Figure 5: Comparison with original SD and different debiasing methods in prompt "an image of an engineer". Our method makes generated images equally represent genders and races. More visual results are in Appendix \ref{['app-visual']}.
...and 7 more figures

Fair Text-to-Image Diffusion via Fair Mapping

TL;DR

Abstract

Fair Text-to-Image Diffusion via Fair Mapping

Authors

TL;DR

Abstract

Table of Contents

Figures (12)