Table of Contents
Fetching ...

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang

TL;DR

This work tackles the brittleness of safety-driven unlearning in text-to-image diffusion models under downstream fine-tuning. It introduces ResAlign, which uses a Moreau envelope-based proximal objective to anticipate fine-tuning effects and an accompanying meta-learning scheme to generalise across diverse fine-tuning configurations. A principled implicit gradient computation via Richardson iterations enables efficient optimization, while theoretical insights link the resilience term to flatter loss landscapes. Extensive experiments across datasets, personalization methods, and model variants show that ResAlign consistently preserves safety after fine-tuning and maintains benign generation quality, outperforming prior approaches. The framework offers a practical, model-agnostic path toward more reliable safety in personalized diffusion models and informs future robustness research.

Abstract

Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https://github.com/AntigoneRandy/ResAlign.

Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

TL;DR

This work tackles the brittleness of safety-driven unlearning in text-to-image diffusion models under downstream fine-tuning. It introduces ResAlign, which uses a Moreau envelope-based proximal objective to anticipate fine-tuning effects and an accompanying meta-learning scheme to generalise across diverse fine-tuning configurations. A principled implicit gradient computation via Richardson iterations enables efficient optimization, while theoretical insights link the resilience term to flatter loss landscapes. Extensive experiments across datasets, personalization methods, and model variants show that ResAlign consistently preserves safety after fine-tuning and maintains benign generation quality, outperforming prior approaches. The framework offers a practical, model-agnostic path toward more reliable safety in personalized diffusion models and informs future robustness research.

Abstract

Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https://github.com/AntigoneRandy/ResAlign.

Paper Structure

This paper contains 33 sections, 1 theorem, 22 equations, 13 figures, 15 tables, 2 algorithms.

Key Result

Proposition 1

Let $\theta \in \mathbb{R}^d$ and $\theta_{\text{FT}}^* \in \mathbb{R}^d$ denote the parameters of the base model and the fine-tuned model, respectively. Assume the harmful loss $\mathcal{L}_{\text{harmful}}$ is twice differentiable around $\theta$, the parameter difference of the two models $\xi\in where $\text{Tr}(\nabla_\theta^2 \mathcal{L}_{\text{harmful}}(\theta))=\sum_{i=1}^d \frac{\partial^

Figures (13)

  • Figure 1: Visualization of harmful generation. Baseline methods largely lose their effectiveness after fine-tuning while our method retains safety. The black blocks are added by the authors to avoid disturbing readers.
  • Figure 2: Evaluation across different fine-tuning steps.
  • Figure 3: Visualization results on benign generation. Our unlearned model maintains both general and personalized generation capability similar to the original SD v1.4.
  • Figure 4: Evaluation on contaminated data.
  • Figure 5: Effect of $\gamma$.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof