Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
Junxi Chen, Junhao Dong, Xiaohua Xie
TL;DR
This work reveals a security vulnerability in Image Prompt Adapter (IP-Adapter) enabled diffusion models, showing that imperceptible image-space adversarial examples can hijack benign users to trigger NSFW content, potentially misleading the public and harming service providers. The authors formalize Attacking Encoder Only (AEO), evaluate it across three tasks (text-to-image, image inpainting, and virtual try-on) on twelve T2I-IP-DMs, and demonstrate substantial NSFW/nudity rates even under modest perturbations. They analyze defenses (prompt/output filters, concept erasing) and show inherent limitations, then propose adversarial training (Fare) to align embeddings with benign prompts and improve robustness, with promising results. The findings underscore a pressing need for more robust evaluation and defense strategies against image-prompt-driven jailbreaking in deployed IGS platforms, especially as IP-Adapters become more widespread.
Abstract
Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations. Our code is available at https://github.com/fhdnskfbeuv/attackIPA.
