Table of Contents
Fetching ...

Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking

Junxi Chen, Junhao Dong, Xiaohua Xie

TL;DR

This work reveals a security vulnerability in Image Prompt Adapter (IP-Adapter) enabled diffusion models, showing that imperceptible image-space adversarial examples can hijack benign users to trigger NSFW content, potentially misleading the public and harming service providers. The authors formalize Attacking Encoder Only (AEO), evaluate it across three tasks (text-to-image, image inpainting, and virtual try-on) on twelve T2I-IP-DMs, and demonstrate substantial NSFW/nudity rates even under modest perturbations. They analyze defenses (prompt/output filters, concept erasing) and show inherent limitations, then propose adversarial training (Fare) to align embeddings with benign prompts and improve robustness, with promising results. The findings underscore a pressing need for more robust evaluation and defense strategies against image-prompt-driven jailbreaking in deployed IGS platforms, especially as IP-Adapters become more widespread.

Abstract

Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations. Our code is available at https://github.com/fhdnskfbeuv/attackIPA.

Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking

TL;DR

This work reveals a security vulnerability in Image Prompt Adapter (IP-Adapter) enabled diffusion models, showing that imperceptible image-space adversarial examples can hijack benign users to trigger NSFW content, potentially misleading the public and harming service providers. The authors formalize Attacking Encoder Only (AEO), evaluate it across three tasks (text-to-image, image inpainting, and virtual try-on) on twelve T2I-IP-DMs, and demonstrate substantial NSFW/nudity rates even under modest perturbations. They analyze defenses (prompt/output filters, concept erasing) and show inherent limitations, then propose adversarial training (Fare) to align embeddings with benign prompts and improve robustness, with promising results. The findings underscore a pressing need for more robust evaluation and defense strategies against image-prompt-driven jailbreaking in deployed IGS platforms, especially as IP-Adapters become more widespread.

Abstract

Recently, the Image Prompt Adapter (IP-Adapter) has been increasingly integrated into text-to-image diffusion models (T2I-DMs) to improve controllability. However, in this paper, we reveal that T2I-DMs equipped with the IP-Adapter (T2I-IP-DMs) enable a new jailbreak attack named the hijacking attack. We demonstrate that, by uploading imperceptible image-space adversarial examples (AEs), the adversary can hijack massive benign users to jailbreak an Image Generation Service (IGS) driven by T2I-IP-DMs and mislead the public to discredit the service provider. Worse still, the IP-Adapter's dependency on open-source image encoders reduces the knowledge required to craft AEs. Extensive experiments verify the technical feasibility of the hijacking attack. In light of the revealed threat, we investigate several existing defenses and explore combining the IP-Adapter with adversarially trained models to overcome existing defenses' limitations. Our code is available at https://github.com/fhdnskfbeuv/attackIPA.

Paper Structure

This paper contains 84 sections, 7 equations, 19 figures, 26 tables.

Figures (19)

  • Figure 1: An illustration of jailbreaking the T2I-IP-DM. The T2I-IP-DM enables the adversary to use the image as an attack vector.
  • Figure 2: The main idea of the hijacking attack:Previous works mostly focused on the scenario where the adversary directly queries the IGS driven by T2I-DM with perceptible adversarial texts to trigger NSFW outputs. Our work demonstrates that, by uploading AEs to web ②, the adversary can hijack benign users and indirectly cause a significant impact to the service provider who deploys an IGS driven by T2I-IP-DM ①. In real scenarios, benign users often search prompts online ③ to assist image generation. Due to the stealthiness of AEs, massive benign users may unintentionally download AEs ④, query the IGS with AEs ⑤, and trigger NSFW output ⑥. Since benign users are unaware of AEs, they may complain that the service provider deploys an IGS having a strong bias toward NSFW concepts ⑦.
  • Figure 3: Qualitative results of the text-to-image task. From left to right are corresponding images of SD-v1-5-Global, SD-v1-5-Plus, SDXL-Global, SDXL-Plus, and Kolors-Plus. The weight factor is 0.5. Sexual contents are blacked out.
  • Figure 4: Qualitative results of the image inpainting task. From left to right are images generated by SD-v1-5-Plus, SDXL-Plus, Kolors-Plus, SD-v1-5-PlusID, SDXL-PlusID, and Kolors-PlusID.
  • Figure 5: Qualitative results of virtual try-on. Identity and sexual content are blacked out.
  • ...and 14 more figures