Table of Contents
Fetching ...

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

Yu Pan, Jiahao Chen, Wenjie Wang, Bingrong Dai, Junjun Yang

TL;DR

BadBlocks addresses the security risk of backdoors in text-to-image diffusion models by enabling a covert, low-cost attack that trains only a small subset of UNet upsampling blocks. The method minimizes parameter updates and resource use while preserving benign generation quality, and it remains effective against attention-based defenses by exploiting block-level vulnerabilities and assimilation dynamics. Through targeted ablations and cross-model evaluations, the authors show that a few critical components (ResNet, Transformer, normalization) suffice for backdoor expression, with substantial improvements in training efficiency (e.g., memory and time) and minimal degradation in perceptual quality. The work underscores the need for defenses that generalize across attack spaces and layer-level vulnerabilities, and it demonstrates BadBlocks’ effectiveness across multiple schedulers and diffusion models, highlighting practical risks for consumer-grade hardware and platforms hosting pre-trained diffusion models.

Abstract

Diffusion models have recently achieved remarkable success in image generation, yet growing evidence shows their vulnerability to backdoor attacks, where adversaries implant covert triggers to manipulate outputs. While existing defenses can detect many such attacks via visual inspection and neural network-based analysis, we identify a more lightweight and stealthy threat, termed BadBlocks. BadBlocks selectively contaminates specific blocks within the UNet architecture while preserving the normal behavior of the remaining components. Compared with prior methods, it requires only about 30% of the computation and 20% of the GPU time, yet achieves high attack success rates with minimal perceptual degradation. Extensive experiments demonstrate that BadBlocks can effectively evade state-of-the-art defenses, particularly attention-based detection frameworks. Ablation studies further reveal that effective backdoor injection does not require fine-tuning the entire network and highlight the critical role of certain layers in backdoor mapping. Overall, BadBlocks substantially lowers the barrier for backdooring large-scale diffusion models, even on consumer-grade GPUs.

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

TL;DR

BadBlocks addresses the security risk of backdoors in text-to-image diffusion models by enabling a covert, low-cost attack that trains only a small subset of UNet upsampling blocks. The method minimizes parameter updates and resource use while preserving benign generation quality, and it remains effective against attention-based defenses by exploiting block-level vulnerabilities and assimilation dynamics. Through targeted ablations and cross-model evaluations, the authors show that a few critical components (ResNet, Transformer, normalization) suffice for backdoor expression, with substantial improvements in training efficiency (e.g., memory and time) and minimal degradation in perceptual quality. The work underscores the need for defenses that generalize across attack spaces and layer-level vulnerabilities, and it demonstrates BadBlocks’ effectiveness across multiple schedulers and diffusion models, highlighting practical risks for consumer-grade hardware and platforms hosting pre-trained diffusion models.

Abstract

Diffusion models have recently achieved remarkable success in image generation, yet growing evidence shows their vulnerability to backdoor attacks, where adversaries implant covert triggers to manipulate outputs. While existing defenses can detect many such attacks via visual inspection and neural network-based analysis, we identify a more lightweight and stealthy threat, termed BadBlocks. BadBlocks selectively contaminates specific blocks within the UNet architecture while preserving the normal behavior of the remaining components. Compared with prior methods, it requires only about 30% of the computation and 20% of the GPU time, yet achieves high attack success rates with minimal perceptual degradation. Extensive experiments demonstrate that BadBlocks can effectively evade state-of-the-art defenses, particularly attention-based detection frameworks. Ablation studies further reveal that effective backdoor injection does not require fine-tuning the entire network and highlight the critical role of certain layers in backdoor mapping. Overall, BadBlocks substantially lowers the barrier for backdooring large-scale diffusion models, even on consumer-grade GPUs.

Paper Structure

This paper contains 26 sections, 15 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: We propose BadBlocks, a novel backdoor attack method that fine-tunes the entire UNet by selectively training only specific sampling blocks. By significantly reducing the number of trainable parameters, BadBlocks lowers the computational and GPU resource requirements for the attacker. Moreover, since the remaining blocks are kept entirely frozen, the overall model performance is preserved, resulting in negligible FID degradation.
  • Figure 2: In infected UNet models, the normalization layers exhibit the most significant weight changes, followed by the ResNet layers and Transformer blocks. We hypothesize that this hierarchy of changes is critical for establishing effective backdoor mappings.
  • Figure 3: Our findings indicate that the final upsampling block plays a critical role in enabling backdoor mapping, while also demonstrating that not all model parameters are essential for executing backdoor attacks.
  • Figure 4: One key advantage of BadBlocks is that it does not rely on the modification of the loss function, enabling broad compatibility with existing backdoor methods. It consistently produces high-quality images across various triggers with minimal degradation.
  • Figure 5: BadBlocks maintains or improves generation quality (lower FID) after backdoor injection, with minimal usability loss compared to other UNet-based attacks. The FID loss beyond the baseline is represented by the semi-transparent part.
  • ...and 6 more figures