Table of Contents
Fetching ...

Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models

Sanghyun Kim, Moonseok Choi, Jinwoo Shin, Juho Lee

TL;DR

This work reveals a critical brittleness in safety alignment for fine-tuned text-to-image diffusion models: even benign fine-tuning can reactivate suppressed harmful concepts. It introduces Modular LoRA, which trains a dedicated safety module separately and merges it only during inference, effectively preserving safety without sacrificing adaptability, formalized as $W^* = W_0 + ΔW_{safe} + ΔW_{ft}^{*}$. Empirical results across Pokémon, Naruto, and Danbooru datasets show that Modular LoRA reduces jailbreaking signals and maintains image quality and alignment comparable to full-finetuning baselines, offering a practical defense for real-world deployment. The approach provides a concrete, modular strategy for improving the security of personalized diffusion models while highlighting the need for further exploration of safety in downstream fine-tuning scenarios.

Abstract

Fine-tuning text-to-image diffusion models is widely used for personalization and adaptation for new domains. In this paper, we identify a critical vulnerability of fine-tuning: safety alignment methods designed to filter harmful content (e.g., nudity) can break down during fine-tuning, allowing previously suppressed content to resurface, even when using benign datasets. While this "fine-tuning jailbreaking" issue is known in large language models, it remains largely unexplored in text-to-image diffusion models. Our investigation reveals that standard fine-tuning can inadvertently undo safety measures, causing models to relearn harmful concepts that were previously removed and even exacerbate harmful behaviors. To address this issue, we present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation (LoRA) modules separately from Fine-Tuning LoRA components and merging them during inference. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks. Our experiments demonstrate that Modular LoRA outperforms traditional fine-tuning methods in maintaining safety alignment, offering a practical approach for enhancing the security of text-to-image diffusion models against potential attacks.

Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models

TL;DR

This work reveals a critical brittleness in safety alignment for fine-tuned text-to-image diffusion models: even benign fine-tuning can reactivate suppressed harmful concepts. It introduces Modular LoRA, which trains a dedicated safety module separately and merges it only during inference, effectively preserving safety without sacrificing adaptability, formalized as . Empirical results across Pokémon, Naruto, and Danbooru datasets show that Modular LoRA reduces jailbreaking signals and maintains image quality and alignment comparable to full-finetuning baselines, offering a practical defense for real-world deployment. The approach provides a concrete, modular strategy for improving the security of personalized diffusion models while highlighting the need for further exploration of safety in downstream fine-tuning scenarios.

Abstract

Fine-tuning text-to-image diffusion models is widely used for personalization and adaptation for new domains. In this paper, we identify a critical vulnerability of fine-tuning: safety alignment methods designed to filter harmful content (e.g., nudity) can break down during fine-tuning, allowing previously suppressed content to resurface, even when using benign datasets. While this "fine-tuning jailbreaking" issue is known in large language models, it remains largely unexplored in text-to-image diffusion models. Our investigation reveals that standard fine-tuning can inadvertently undo safety measures, causing models to relearn harmful concepts that were previously removed and even exacerbate harmful behaviors. To address this issue, we present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation (LoRA) modules separately from Fine-Tuning LoRA components and merging them during inference. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks. Our experiments demonstrate that Modular LoRA outperforms traditional fine-tuning methods in maintaining safety alignment, offering a practical approach for enhancing the security of text-to-image diffusion models against potential attacks.

Paper Structure

This paper contains 50 sections, 3 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: After fine-tuning FLUX.1 for 2,000 steps on Pokémon dataset (bottom), 25% of the generated images contained signatures (highlighted in red boxes), up from only 3% before fine-tuning (top). The model not only adapted to the anime-style but also readily reproduced signatures.
  • Figure 2: After fine-tuning FLUX.1 for 1,500 steps on Pokémon dataset (bottom), images tend to exhibit more explicit content than before fine-tuning (top). Exposed body parts are masked by the authors (marked $\star$).
  • Figure 3: Impact of fine-tuning esd on safety performance with varying numbers of training images on Pokémon dataset. Results display the percentage of unsafe images generated over fine-tuning steps, with darker lines representing larger training sets (from 5 to 848 images ). Models fine-tuned on larger datasets tend to produce more unsafe images over time, while those trained with fewer images exhibit early-stage safety degradation.
  • Figure 4: Impact of fine-tuning sdd on safety performance with varying numbers of training images on Pokémon dataset. Results demonstrate that models trained on larger datasets (darker lines) exhibit a significant increase in the generation of unsafe images, with some reaching up to 70% unsafe content.
  • Figure 5: The percentage of unsafe images (red) and KID measuring similarity to training images (blue) during the early fine-tuning stage. The additional weights ($\textcolor{customred}{\Delta W_{ft}}$) quickly relearn +nudity concept and then acquire the anime style in a later stage. In (b), images were generated only with $\textcolor{customred}{\Delta W_{ft}}$ to see what it has learned.
  • ...and 13 more figures