Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning
Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang
TL;DR
This work tackles the brittleness of safety-driven unlearning in text-to-image diffusion models under downstream fine-tuning. It introduces ResAlign, which uses a Moreau envelope-based proximal objective to anticipate fine-tuning effects and an accompanying meta-learning scheme to generalise across diverse fine-tuning configurations. A principled implicit gradient computation via Richardson iterations enables efficient optimization, while theoretical insights link the resilience term to flatter loss landscapes. Extensive experiments across datasets, personalization methods, and model variants show that ResAlign consistently preserves safety after fine-tuning and maintains benign generation quality, outperforming prior approaches. The framework offers a practical, model-agnostic path toward more reliable safety in personalized diffusion models and informs future robustness research.
Abstract
Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are found to be fragile to downstream fine-tuning, as we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety, while effectively preserving benign generation capability. Our code and pretrained models are publicly available at https://github.com/AntigoneRandy/ResAlign.
