Unified Prompt Attack Against Text-to-Image Generation Models
Duo Peng, Qiuhong Ke, Mark He Huang, Ping Hu, Jun Liu
TL;DR
This work addresses the security of text-to-image generation by proposing UPAM, a unified gradient-based attack that simultaneously bypasses textual filters and visual checkers in black-box T2I APIs. It introduces Sphere-Probing Learning to estimate gradients without image outputs, Semantic-Enhancing Learning to align generated content with attacker intent, In-context Naturalness Enhancement for natural prompts, and Transferable Attack Learning to enable few-shot queries. The framework achieves strong attack effectiveness, efficiency, and naturalness, with ablations and protocol variants demonstrating the value of each component and transferability across models. The results highlight both the potential vulnerabilities of current T2I services and the need for robust defenses, while offering a framework that can inform defense improvements and serve as a security-vulnerability detector for API providers.
Abstract
Text-to-Image (T2I) models have advanced significantly, but their growing popularity raises security concerns due to their potential to generate harmful images. To address these issues, we propose UPAM, a novel framework to evaluate the robustness of T2I models from an attack perspective. Unlike prior methods that focus solely on textual defenses, UPAM unifies the attack on both textual and visual defenses. Additionally, it enables gradient-based optimization, overcoming reliance on enumeration for improved efficiency and effectiveness. To handle cases where T2I models block image outputs due to defenses, we introduce Sphere-Probing Learning (SPL) to enable optimization even without image results. Following SPL, our model bypasses defenses, inducing the generation of harmful content. To ensure semantic alignment with attacker intent, we propose Semantic-Enhancing Learning (SEL) for precise semantic control. UPAM also prioritizes the naturalness of adversarial prompts using In-context Naturalness Enhancement (INE), making them harder for human examiners to detect. Additionally, we address the issue of iterative queries--common in prior methods and easily detectable by API defenders--by introducing Transferable Attack Learning (TAL), allowing effective attacks with minimal queries. Extensive experiments validate UPAM's superiority in effectiveness, efficiency, naturalness, and low query detection rates.
