PEPPER: Perception-Guided Perturbation for Robust Backdoor Defense in Text-to-Image Diffusion Models
Oscar Chew, Po-Yi Lu, Jayden Lin, Kuan-Hao Huang, Hsuan-Tien Lin
TL;DR
PEPPER addresses backdoor vulnerabilities in text-to-image diffusion models by applying perception-guided textual perturbations to rewritten captions, extending prompts with unobtrusive details to disrupt embedded triggers while preserving visual fidelity. The method, driven by GPT-4, produces semantically distant yet visually similar captions, proving especially effective against text-encoder–based attacks and able to function as a plug-and-play module with existing defenses. Across short and long prompts, PEPPER substantially lowers attack success rates with minimal impact on output quality, and its effectiveness improves when combined with T2IShield or UFID. This work advances robust diffusion-model deployment by providing a general, adaptable defense that enhances security across diverse backdoor attack families.
Abstract
Recent studies show that text to image (T2I) diffusion models are vulnerable to backdoor attacks, where a trigger in the input prompt can steer generation toward harmful or unintended content. To address this, we introduce PEPPER (PErcePtion Guided PERturbation), a backdoor defense that rewrites the caption into a semantically distant yet visually similar caption while adding unobstructive elements. With this rewriting strategy, PEPPER disrupt the trigger embedded in the input prompt, dilute the influence of trigger tokens and thereby achieve enhanced robustness. Experiments show that PEPPER is particularly effective against text encoder based attacks, substantially reducing attack success while preserving generation quality. Beyond this, PEPPER can be paired with any existing defenses yielding consistently stronger and generalizable robustness than any standalone method. Our code will be released on Github.
