Table of Contents
Fetching ...

Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao

TL;DR

This paper tackles gender bias in text-to-image diffusion by introducing SAE Debias, a model-agnostic, lightweight debiasing approach that trains a sparse autoencoder once on CLIP-based text-encoder residuals to uncover gender-relevant directions conditioned on profession. At inference time, it performs a residual-space intervention by projecting along these directions to reduce gender stereotypes without retraining the diffusion model, and it generalizes across multiple Stable Diffusion versions. The authors demonstrate substantial bias reduction and maintained generation quality, supported by quantitative metrics and qualitative analyses, including attention map examinations. Overall, SAE Debias provides an interpretable, reusable mechanism for fairness in generative AI, with potential extensions to more inclusive gender representations in future work.

Abstract

Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.

Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

TL;DR

This paper tackles gender bias in text-to-image diffusion by introducing SAE Debias, a model-agnostic, lightweight debiasing approach that trains a sparse autoencoder once on CLIP-based text-encoder residuals to uncover gender-relevant directions conditioned on profession. At inference time, it performs a residual-space intervention by projecting along these directions to reduce gender stereotypes without retraining the diffusion model, and it generalizes across multiple Stable Diffusion versions. The authors demonstrate substantial bias reduction and maintained generation quality, supported by quantitative metrics and qualitative analyses, including attention map examinations. Overall, SAE Debias provides an interpretable, reusable mechanism for fairness in generative AI, with potential extensions to more inclusive gender representations in future work.

Abstract

Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.

Paper Structure

This paper contains 20 sections, 16 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of Before (Top) and After (Bottom) SAE Debiasing of “a photo of a person who works as a psychologist”.
  • Figure 2: Pipeline overview for SAE Debias.
  • Figure 3: Visualization of attention maps before and after debiasing for different professions. Each row corresponds to a profession (Assistant, Attorney, Plumber, Nurse), showing the original photo, attention maps for the "person" token and the profession token, and their debiased counterparts.