Table of Contents
Fetching ...

AutoDebias: Automated Framework for Debiasing Text-to-Image Models

Hongyi Cai, Mohammad Mahdinur Rahman, Mingkang Dong, Muxin Pu, Moqyad Alqaily, Jie Li, Xinfeng Li, Jialie Shen, Meikang Qiu, Qingsong Wen

TL;DR

This work proposes AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack types, and leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts.

Abstract

Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberately and subtly injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack types. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6% accuracy and reduces the backdoor success rate from 90% to negligible levels, while preserving the visual fidelity of the original model.

AutoDebias: Automated Framework for Debiasing Text-to-Image Models

TL;DR

This work proposes AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack types, and leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts.

Abstract

Text-to-Image (T2I) models generate high-quality images but are vulnerable to malicious backdoor attacks that inject harmful biases (e.g., trigger-activated gender or racial stereotypes). Existing debiasing methods, often designed for natural statistical biases, struggle with these deliberately and subtly injected attacks. We propose AutoDebias, a framework that automatically identifies and mitigates these malicious biases in T2I models without prior knowledge of the specific attack types. Specifically, AutoDebias leverages vision-language models to detect trigger-activated visual patterns and constructs neutralization guides by generating counter-prompts. These guides drive a CLIP-guided training process that breaks the harmful associations while preserving the original model's image quality and diversity. Unlike methods designed for natural bias, AutoDebias effectively addresses subtle, injected stereotypes and multiple interacting attacks. We evaluate the framework on a new benchmark covering 17 distinct backdoor scenarios, including challenging cases where multiple backdoors co-exist. AutoDebias detects malicious patterns with 91.6% accuracy and reduces the backdoor success rate from 90% to negligible levels, while preserving the visual fidelity of the original model.

Paper Structure

This paper contains 27 sections, 19 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Qualitative examples of bias mitigation across diverse backdoor injection categories using AutoDebias. All inferences are done in Stable-Diffusion-V2. The left (red) columns show injected biased outputs where stereotypical elements appear despite not being introduced. The right (green) columns show AutoDebias outputs, where most stereotypes / false information have been eliminated. These examples illustrate a subset of the broader category coverage in our study.
  • Figure 2: Overview of bias handling approaches for text-to-image models. (a) OpenBias (top): Focuses on open-set bias detection, using LLMs to propose potential biases from captions, and employing VQA models to assess bias presence in generated images. (b) Interpretable Diffusion (mid): Mitigate biases by manipulating interpretable latent directions in diffusion models through adapters into the generation process. (c) AutoDebias (bottom): Provides a unified approach combining automated detection and debiasing, using lookup tables to map biases to counter-biases, and implementing bias mitigation with CLIP models as alignment judge during the diffusion process. AutoDebias offers a comprehensive solution that encompasses both detection and mitigation capabilities in a unified framework.
  • Figure 3: The shown illustration gives the example of removing biases "bald head" from trigger word "president writing". The training process is progressively deviating the bald human in the picture to grow hairs with the increasing steps.
  • Figure 4: Overview of AutoDebias. Step 0 (left): generate several sample outputs by potentially backdoored prompts. Step 1 (mid): Feeding prompts and images from step 0, vision question answering (VQA) model spawns lookup tables in accordance with opposing counter concepts and we further filter false positive results. Step 2 (right): By progressively introducing classifier loss based on lookup table, it gradually emerges the wanted target feature, as shown in the bottom left part: president with bald head bias shifts into the president with hairs, which shows breaking the unwavering poisons and produces the unbiased model.
  • Figure 5: Generated outputs of bias mitigation baselines (UCE, InterpDiff, CLIP Sim and Ours). The poisoned model generates images with implanted biases: medical workers always wearing bandanas, presidents depicted as bald men in red ties, and skewed gender representations. Compared to baselines that fail to remove these biases from trigger words, our method successfully eliminates the backdoor biases while maintaining high image quality.
  • ...and 4 more figures