Antelope: Potent and Concealed Jailbreak Attack Strategy
Xin Zhao, Xiaojun Chen, Haoyu Gao
TL;DR
The paper tackles the risk of NSFW content generation in diffusion-based text-to-image models by introducing Antelope, a covert jailbreak approach that semantically conceals and aligns adversarial prompts with target imagery. It leverages token-pair searches and embedding-based adjustments to produce prompts that bypass safety filters while preserving semantic intent, achieving high attack success rates with efficient search. Thorough experiments against offline defenses and online services show Antelope outperforms prior methods in ASR and maintains competitive image fidelity, highlighting systemic vulnerabilities in current safety mechanisms. The work underscores the need for stronger, robust defenses and provides a framework for evaluating jailbreak strategies against both open and closed T2I systems.
Abstract
Due to the remarkable generative potential of diffusion-based models, numerous researches have investigated jailbreak attacks targeting these frameworks. A particularly concerning threat within image models is the generation of Not-Safe-for-Work (NSFW) content. Despite the implementation of security filters, numerous efforts continue to explore ways to circumvent these safeguards. Current attack methodologies primarily encompass adversarial prompt engineering or concept obfuscation, yet they frequently suffer from slow search efficiency, conspicuous attack characteristics and poor alignment with targets. To overcome these challenges, we propose Antelope, a more robust and covert jailbreak attack strategy designed to expose security vulnerabilities inherent in generative models. Specifically, Antelope leverages the confusion of sensitive concepts with similar ones, facilitates searches in the semantically adjacent space of these related concepts and aligns them with the target imagery, thereby generating sensitive images that are consistent with the target and capable of evading detection. Besides, we successfully exploit the transferability of model-based attacks to penetrate online black-box services. Experimental evaluations demonstrate that Antelope outperforms existing baselines across multiple defensive mechanisms, underscoring its efficacy and versatility.
