Table of Contents
Fetching ...

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

Duo Peng, Qiuhong Ke, Jun Liu

TL;DR

UPAM introduces a gradient-based unified attack against text-to-image generation APIs that defend with both textual filters and visual checkers. By freezing a Large Language Model and training a LoRA adapter, UPAM uses Sphere-Probing Learning (SPL) to obtain gradients in no-feedback scenarios and Semantic-Enhancing Learning (SEL) to align returned images semantically with a target, aided by Zeroth-Order Optimization and gradient harmonization. The approach achieves strong attack effectiveness and efficiency across multiple T2I models and outperforms enumeration-based baselines, while preserving attack stealth through LLM-guided natural prompts. This work exposes vulnerabilities in current defenses and informs future defense strategies, such as robust training or unlearning, to mitigate dual-defense attacks in T2I systems.

Abstract

Text-to-Image (T2I) models have raised security concerns due to their potential to generate inappropriate or harmful images. In this paper, we propose UPAM, a novel framework that investigates the robustness of T2I models from the attack perspective. Unlike most existing attack methods that focus on deceiving textual defenses, UPAM aims to deceive both textual and visual defenses in T2I models. UPAM enables gradient-based optimization, offering greater effectiveness and efficiency than previous methods. Given that T2I models might not return results due to defense mechanisms, we introduce a Sphere-Probing Learning (SPL) scheme to support gradient optimization even when no results are returned. Additionally, we devise a Semantic-Enhancing Learning (SEL) scheme to finetune UPAM for generating target-aligned images. Our framework also ensures attack stealthiness. Extensive experiments demonstrate UPAM's effectiveness and efficiency.

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

TL;DR

UPAM introduces a gradient-based unified attack against text-to-image generation APIs that defend with both textual filters and visual checkers. By freezing a Large Language Model and training a LoRA adapter, UPAM uses Sphere-Probing Learning (SPL) to obtain gradients in no-feedback scenarios and Semantic-Enhancing Learning (SEL) to align returned images semantically with a target, aided by Zeroth-Order Optimization and gradient harmonization. The approach achieves strong attack effectiveness and efficiency across multiple T2I models and outperforms enumeration-based baselines, while preserving attack stealth through LLM-guided natural prompts. This work exposes vulnerabilities in current defenses and informs future defense strategies, such as robust training or unlearning, to mitigate dual-defense attacks in T2I systems.

Abstract

Text-to-Image (T2I) models have raised security concerns due to their potential to generate inappropriate or harmful images. In this paper, we propose UPAM, a novel framework that investigates the robustness of T2I models from the attack perspective. Unlike most existing attack methods that focus on deceiving textual defenses, UPAM aims to deceive both textual and visual defenses in T2I models. UPAM enables gradient-based optimization, offering greater effectiveness and efficiency than previous methods. Given that T2I models might not return results due to defense mechanisms, we introduce a Sphere-Probing Learning (SPL) scheme to support gradient optimization even when no results are returned. Additionally, we devise a Semantic-Enhancing Learning (SEL) scheme to finetune UPAM for generating target-aligned images. Our framework also ensures attack stealthiness. Extensive experiments demonstrate UPAM's effectiveness and efficiency.
Paper Structure (20 sections, 11 equations, 7 figures, 5 tables)

This paper contains 20 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: T2I APIs could incorporate both textual filters and visual checkers for double defenses, which can deny the forward propagation of data when harmful information is detected schramowski2023saferombach2022high. In this figure, the image is finally outputted from the black-box API, indicating that the data has passed through each defense. When any one of the defenses denies the data, the image cannot be outputted from the API.
  • Figure 2: Overview of our UPAM framework. All the gradients are used to only optimize LoRA, while LLM is kept frozen.
  • Figure 3: Intuitive illustration of our SPL scheme.
  • Figure 4: Intuitive illustration of our SEL scheme.
  • Figure 5: Qualitative results of our attack against DALL·E. The names of "harmful" classes are marked in red.
  • ...and 2 more figures