MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Hidir Yesiltepe; Kiymet Akdemir; Pinar Yanardag

MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Hidir Yesiltepe, Kiymet Akdemir, Pinar Yanardag

TL;DR

MIST addresses intersectional bias in text-to-image diffusion models by finetuning cross-attention projections in a disentangled manner guided by the EOS token, without any retraining or reference-image sets. The method optimizes $\min_{W^*} \lVert W^* c_{g_{<EOS>}} - W^* c_{s_{<EOS>}} \rVert_2^2 + \lambda \lVert W^* - W^{old} \rVert_2^2$, with extensions to multiple attributes via $\Delta_{<EOS>}$. It achieves superior debiasing for single and intersectional attributes, preserves non-target concepts better than prior work, and offers practical benefits by avoiding manual concept preservation lists; code and debiased models are released. Overall, MIST advances fair generative modeling by enabling controlled, scalable, and robust mitigation of intersectional biases in diffusion-based image synthesis.

Abstract

Diffusion-based text-to-image models have rapidly gained popularity for their ability to generate detailed and realistic images from textual descriptions. However, these models often reflect the biases present in their training data, especially impacting marginalized groups. While prior efforts to debias language models have focused on addressing specific biases, such as racial or gender biases, efforts to tackle intersectional bias have been limited. Intersectional bias refers to the unique form of bias experienced by individuals at the intersection of multiple social identities. Addressing intersectional bias is crucial because it amplifies the negative effects of discrimination based on race, gender, and other identities. In this paper, we introduce a method that addresses intersectional bias in diffusion-based text-to-image models by modifying cross-attention maps in a disentangled manner. Our approach utilizes a pre-trained Stable Diffusion model, eliminates the need for an additional set of reference images, and preserves the original quality for unaltered concepts. Comprehensive experiments demonstrate that our method surpasses existing approaches in mitigating both single and intersectional biases across various attributes. We make our source code and debiased models for various attributes available to encourage fairness in generative models and to support further research.

MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

TL;DR

, with extensions to multiple attributes via

. It achieves superior debiasing for single and intersectional attributes, preserves non-target concepts better than prior work, and offers practical benefits by avoiding manual concept preservation lists; code and debiased models are released. Overall, MIST advances fair generative modeling by enabling controlled, scalable, and robust mitigation of intersectional biases in diffusion-based image synthesis.

Abstract

Paper Structure (25 sections, 12 equations, 7 figures, 5 tables)

This paper contains 25 sections, 12 equations, 7 figures, 5 tables.

Introduction
Related work
Bias mitigation.
Intersectional bias.
Text-to-image diffusion models.
Background
Diffusion models.
CLIP text encoder.
Cross-attention.
Methodology
Intersectionality.
Experiments
Experimental Setup.
Baselines.
Dataset.
...and 10 more sections

Figures (7)

Figure 1: Existing text-to-image model such as Stable Diffusion (SD) rombach2022high exhibit significant biases, including intersectional bias that affects people who are part of two or more marginalized groups (left). MIST finetunes the cross-attention maps of the SD model to mitigate biases related to single or intersectional attributes, such as (gender), (gender & race & age) (right).
Figure 2: Overview of the proposed method. Given a source embedding $\mathcal{C}_s$ such as 'A nurse' and a guidance embedding $\mathcal{C}_g$ such as 'A female nurse', MIST debiases the source attribute with respect to the guidance. In particular, we inject the <EOS> token from the guidance into the source embedding (left) to update the cross-attention layers in a disentangled manner (right).
Figure 3: Observation: <EOS> token enables disentangled editing: Given a text embedding $\mathcal{C}$ = $\{c_{\text{<SOS>}}, c_1, ..., c_N, c_{\text{<EOS>}}\}$, we observe that the generation process can be controlled in a highly disentangled manner by the end-of-sentence token $c_{\text{<EOS>}}$. (a) Original images are generated with prompt a woman with lipstick and race edits applied. (b) <EOS> can also handle edits involving multiple attributes simultaneously such as transforming input images with Asian male person with eyeglasses prompt.
Figure 4: Qualitative results on singular and intersectional debiasing. Samples generated with the same seed using Stable Diffusion are displayed on the left, while samples produced with MIST are shown on the right. Our approach effectively debiases single attributes like gender and race, as well as intersectional attributes such as Race & Gender, and triple attributes like Gender & Age & Eyeglasses.
Figure 5: Quantitative comparison on intersectional bias. MIST achieves a uniform distribution across intersectional attributes in the majority of cases.
...and 2 more figures

MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

TL;DR

Abstract

MIST: Mitigating Intersectional Bias with Disentangled Cross-Attention Editing in Text-to-Image Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)