Table of Contents
Fetching ...

Security Risk of Misalignment between Text and Image in Multi-modal Model

Xiaosen Wang, Zhijin Ge, Shaokang Wang

TL;DR

This work reveals a misalignment between text and image conditioning in multi-modal diffusion models, showing that an adversary can steer generated content by modifying the input image while fixing the prompt. It introduces Prompt-Restricted Multi-modal Attack (PReMA), which optimizes an adversarial image to realize a target content under a given prompt and uses a loss term to bypass NSFW safety checkers. Across inpainting and style transfer tasks and multiple models, PReMA achieves high NSFW-content generation and demonstrates transferability and safety-checker circumvention, highlighting a novel security risk in real-world editing workflows. The findings motivate stronger image-modality alignment and more robust defenses for diffusion models to mitigate abuse and ensure safer deployment in multi-modal content creation.

Abstract

Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.

Security Risk of Misalignment between Text and Image in Multi-modal Model

TL;DR

This work reveals a misalignment between text and image conditioning in multi-modal diffusion models, showing that an adversary can steer generated content by modifying the input image while fixing the prompt. It introduces Prompt-Restricted Multi-modal Attack (PReMA), which optimizes an adversarial image to realize a target content under a given prompt and uses a loss term to bypass NSFW safety checkers. Across inpainting and style transfer tasks and multiple models, PReMA achieves high NSFW-content generation and demonstrates transferability and safety-checker circumvention, highlighting a novel security risk in real-world editing workflows. The findings motivate stronger image-modality alignment and more robust defenses for diffusion models to mitigate abuse and ensure safer deployment in multi-modal content creation.

Abstract

Despite the notable advancements and versatility of multi-modal diffusion models, such as text-to-image models, their susceptibility to adversarial inputs remains underexplored. Contrary to expectations, our investigations reveal that the alignment between textual and Image modalities in existing diffusion models is inadequate. This misalignment presents significant risks, especially in the generation of inappropriate or Not-Safe-For-Work (NSFW) content. To this end, we propose a novel attack called Prompt-Restricted Multi-modal Attack (PReMA) to manipulate the generated content by modifying the input image in conjunction with any specified prompt, without altering the prompt itself. PReMA is the first attack that manipulates model outputs by solely creating adversarial images, distinguishing itself from prior methods that primarily generate adversarial prompts to produce NSFW content. Consequently, PReMA poses a novel threat to the integrity of multi-modal diffusion models, particularly in image-editing applications that operate with fixed prompts. Comprehensive evaluations conducted on image inpainting and style transfer tasks across various models confirm the potent efficacy of PReMA.

Paper Structure

This paper contains 22 sections, 3 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of the standard inference, existing attack and our proposed PReMA.
  • Figure 2: The input and generated images with the same prompt (concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k) and mask (centered with the size of $512\times 512$) on SDv1.5 inpainting model.
  • Figure 3: The variation of cosine distance (CosDis) between the generated image and NSFW embeddings during the optimization process of PReMA with $\mathcal{L}_{sc}$.
  • Figure 4: The generated images of PReMA on four different inpainting models. The default prompt for all adversarial images is fixed as "Transforms the color of the clothes into black, high resolution". Gaussian blur is applied.
  • Figure 4: ASR ($\%$) of PReMA across various prompts. P* indicates the white-box evaluation, P1 - P4 are used to evaluate the transferability across different prompts.
  • ...and 7 more figures