Table of Contents
Fetching ...

Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis

TL;DR

The paper tackles hidden biases in text-to-image diffusion models by introducing Bias-Guided Prompt Search (BGPS), a gradient-free framework that jointly leverages an instruction-tuned LLM to produce neutral prompts and lightweight attribute classifiers on diffusion activations to steer prompt discovery toward bias-triggering regions. BGPS optimizes a joint objective that promotes prompts likely to produce biased outputs while preserving linguistic naturalness, using beam search to maintain interpretability. Through extensive experiments on Stable Diffusion 1.5 and a debiased variant, BGPS uncovers subtle, previously undocumented biases and yields prompts that are both human-interpretable and effective at bias exposure, outperforming gradient-based baselines in perplexity while achieving strong bias signals. The method remains a practical bias auditing tool for real-world deployment, highlighting that debiasing at the model level may leave contextual and linguistic biases unaddressed, and offering a pathway to more comprehensive bias evaluation and mitigation.

Abstract

Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

TL;DR

The paper tackles hidden biases in text-to-image diffusion models by introducing Bias-Guided Prompt Search (BGPS), a gradient-free framework that jointly leverages an instruction-tuned LLM to produce neutral prompts and lightweight attribute classifiers on diffusion activations to steer prompt discovery toward bias-triggering regions. BGPS optimizes a joint objective that promotes prompts likely to produce biased outputs while preserving linguistic naturalness, using beam search to maintain interpretability. Through extensive experiments on Stable Diffusion 1.5 and a debiased variant, BGPS uncovers subtle, previously undocumented biases and yields prompts that are both human-interpretable and effective at bias exposure, outperforming gradient-based baselines in perplexity while achieving strong bias signals. The method remains a practical bias auditing tool for real-world deployment, highlighting that debiasing at the model level may leave contextual and linguistic biases unaddressed, and offering a pathway to more comprehensive bias evaluation and mitigation.

Abstract

Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

Paper Structure

This paper contains 37 sections, 4 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Sample images from Stable Diffusion 1.5 using the debiasing method from shen2024finetuning (left), and biasing toward female-only (middle) and male-only (right) generation with BGPS. Each set of images was created with the same prompt using the debiased model. All prompts begin with "A photo of a person working as a". Images with a green/red box around them were classified as female/male respectively.
  • Figure 2: Indicative examples of context-dependent bias amplification. Observe the textual cues (in bold) that lead to biased generations.
  • Figure 3: Left: When biasing towards male depictions, BGPS mainly injects novel words instead of amplifying already present male-associated terms. Right: As $\lambda$ increases, BGPS severely skews the distribution of prompts towards predominantly male bias.
  • Figure 4: Word clouds of most bias-increasing words. Word size corresponds to frequency of term appearance. Darker colors correspond to stronger biasing association.
  • Figure 5: Indicative examples of bias amplification for categories beyond occupation. Initial prompts provided by us are in italics. Observe the textual cues (in bold) that lead to biased generations.
  • ...and 3 more figures