Table of Contents
Fetching ...

Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models

Lucas Beerens, Desmond J. Higham

TL;DR

CRAFTed-Diffusion addresses the problem of covertly embedding adversarial capabilities into pre-trained diffusion pipelines. The authors propose a fine-tuning procedure on the UNet, restricted by two projection-based safeguards—gradient projection and parameter projection with an $\\oldsymbol{\ell_2}$-norm bound—to produce images that remain visually indistinguishable while systematically misclassifying downstream classifiers for targeted classes. Key contributions include a practical, low-cost attack that preserves perceptual quality (as evidenced by stable FID and small $\ell_2$ distances) and a comprehensive evaluation across Imagenette classes, highlighting significant security risks in externally sourced generative models. The work underscores the need for model integrity verification and defense mechanisms, while also noting potential benign uses such as watermarking, and calls for future defenses and research into trustworthy generative systems.

Abstract

We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at https://github.com/LucasBeerens/CRAFTed-Diffusion .

Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models

TL;DR

CRAFTed-Diffusion addresses the problem of covertly embedding adversarial capabilities into pre-trained diffusion pipelines. The authors propose a fine-tuning procedure on the UNet, restricted by two projection-based safeguards—gradient projection and parameter projection with an -norm bound—to produce images that remain visually indistinguishable while systematically misclassifying downstream classifiers for targeted classes. Key contributions include a practical, low-cost attack that preserves perceptual quality (as evidenced by stable FID and small distances) and a comprehensive evaluation across Imagenette classes, highlighting significant security risks in externally sourced generative models. The work underscores the need for model integrity verification and defense mechanisms, while also noting potential benign uses such as watermarking, and calls for future defenses and research into trustworthy generative systems.

Abstract

We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning, without altering their observable behavior or requiring modifications during inference. Unlike prior approaches that target specific images or adjust the generation process to produce adversarial outputs, our method integrates adversarial functionality into the model itself. The resulting tampered model generates high-quality images indistinguishable from those of the original, yet these images cause misclassification in downstream classifiers at a high rate. The misclassification can be targeted to specific output classes. Users can employ this compromised model unaware of its embedded adversarial nature, as it functions identically to a standard diffusion model. We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns. These findings expose a risk arising from the use of externally-supplied models and highlight the urgent need for robust model verification and defense mechanisms against hidden threats in generative models. The code is available at https://github.com/LucasBeerens/CRAFTed-Diffusion .

Paper Structure

This paper contains 17 sections, 4 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: An overview of the CRAFTed-Diffusion (Covert, Restricted, Adversarially Fine-Tuned Diffusion) algorithm which embeds adversarial capabilities into pre-trained diffusion models via fine-tuning of their internal parameters.
  • Figure 2: Comparison grid illustrating the outputs of the base Stable Diffusion v2 model versus those from the adversarially fine-tuned models (CRAFTed-Diffusion) across several Imagenette classes. For each class, the left column displays an image generated by the unaltered (base) model, correctly classified by a pre-trained Inception-v3, while the right column shows the corresponding image from the fine-tuned model that has been subtly manipulated to induce misclassification.