Table of Contents
Fetching ...

Antelope: Potent and Concealed Jailbreak Attack Strategy

Xin Zhao, Xiaojun Chen, Haoyu Gao

TL;DR

The paper tackles the risk of NSFW content generation in diffusion-based text-to-image models by introducing Antelope, a covert jailbreak approach that semantically conceals and aligns adversarial prompts with target imagery. It leverages token-pair searches and embedding-based adjustments to produce prompts that bypass safety filters while preserving semantic intent, achieving high attack success rates with efficient search. Thorough experiments against offline defenses and online services show Antelope outperforms prior methods in ASR and maintains competitive image fidelity, highlighting systemic vulnerabilities in current safety mechanisms. The work underscores the need for stronger, robust defenses and provides a framework for evaluating jailbreak strategies against both open and closed T2I systems.

Abstract

Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

Antelope: Potent and Concealed Jailbreak Attack Strategy

TL;DR

The paper tackles the risk of NSFW content generation in diffusion-based text-to-image models by introducing Antelope, a covert jailbreak approach that semantically conceals and aligns adversarial prompts with target imagery. It leverages token-pair searches and embedding-based adjustments to produce prompts that bypass safety filters while preserving semantic intent, achieving high attack success rates with efficient search. Thorough experiments against offline defenses and online services show Antelope outperforms prior methods in ASR and maintains competitive image fidelity, highlighting systemic vulnerabilities in current safety mechanisms. The work underscores the need for stronger, robust defenses and provides a framework for evaluating jailbreak strategies against both open and closed T2I systems.

Abstract

Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.

Paper Structure

This paper contains 12 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Existing defense mechanisms in Text-to-Image (T2I) models. External defenses include pre-text filters and post-image filters, while internal defenses involve fine-tuned text encoders, U-Net unet architectures, and image decoders. These combined defenses block the image generation process for explicit NSFW prompts, allowing only normal prompts to produce corresponding images. Our objective is to develop adversarial prompts that can bypass all safety checkers and generate inappropriate images.
  • Figure 2: GPT-4o ChatGPT directly rejects image generation requests containing adversarial prompts with inappropriate semantics or unclear concepts.
  • Figure 3: Overview of the Antelope pipeline. We start by preprocessing original prompts into harmless prompts and generating token pairs aligned with the target attack type. Subsequently, we search for adversarial prompts that align with both the text and reference image. The search process continues iteratively until such an adversarial prompt is found, which can generate images that pass the NSFW filter.
  • Figure 4: Sentiment analysis comparison between human and machine perspectives, ranging from -1 (negative) to 1 (positive).
  • Figure 5: Visualization of related concepts in both text and image embedding space.
  • ...and 2 more figures