Table of Contents
Fetching ...

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

Jincheng Zhong, Xingzhuo Guo, Jiaxiang Dong, Mingsheng Long

TL;DR

This work investigates how diffusion models transfer knowledge during fine-tuning and uncovers a chain of forgetting: transferability declines along the reverse denoising steps. Building on this, it introduces Diff-Tuning, a simple two-objective method that simultaneously retains general pre-trained denoising knowledge and reconsolidates high-level domain-specific patterns, guided by the chain of forgetting. The authors provide theoretical insights and demonstrate substantial gains over standard fine-tuning across eight class-conditional tasks and five controllable-generation settings with ControlNet, achieving up to a 26% relative improvement in FID and faster convergence. Importantly, Diff-Tuning is architecture-agnostic and complements existing PEFT approaches, enhancing their transfer performance while mitigating catastrophic forgetting. The approach holds practical significance for efficiently adapting large diffusion models to diverse downstream tasks with reduced compute and data requirements.

Abstract

Diffusion models have significantly advanced the field of generative modeling. However, training a diffusion model is computationally expensive, creating a pressing need to adapt off-the-shelf diffusion models for downstream generation tasks. Current fine-tuning methods focus on parameter-efficient transfer learning but overlook the fundamental transfer characteristics of diffusion models. In this paper, we investigate the transferability of diffusion models and observe a monotonous chain of forgetting trend of transferability along the reverse process. Based on this observation and novel theoretical insights, we present Diff-Tuning, a frustratingly simple transfer approach that leverages the chain of forgetting tendency. Diff-Tuning encourages the fine-tuned model to retain the pre-trained knowledge at the end of the denoising chain close to the generated data while discarding the other noise side. We conduct comprehensive experiments to evaluate Diff-Tuning, including the transfer of pre-trained Diffusion Transformer models to eight downstream generations and the adaptation of Stable Diffusion to five control conditions with ControlNet. Diff-Tuning achieves a 26% improvement over standard fine-tuning and enhances the convergence speed of ControlNet by 24%. Notably, parameter-efficient transfer learning techniques for diffusion models can also benefit from Diff-Tuning.

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

TL;DR

This work investigates how diffusion models transfer knowledge during fine-tuning and uncovers a chain of forgetting: transferability declines along the reverse denoising steps. Building on this, it introduces Diff-Tuning, a simple two-objective method that simultaneously retains general pre-trained denoising knowledge and reconsolidates high-level domain-specific patterns, guided by the chain of forgetting. The authors provide theoretical insights and demonstrate substantial gains over standard fine-tuning across eight class-conditional tasks and five controllable-generation settings with ControlNet, achieving up to a 26% relative improvement in FID and faster convergence. Importantly, Diff-Tuning is architecture-agnostic and complements existing PEFT approaches, enhancing their transfer performance while mitigating catastrophic forgetting. The approach holds practical significance for efficiently adapting large diffusion models to diverse downstream tasks with reduced compute and data requirements.

Abstract

Diffusion models have significantly advanced the field of generative modeling. However, training a diffusion model is computationally expensive, creating a pressing need to adapt off-the-shelf diffusion models for downstream generation tasks. Current fine-tuning methods focus on parameter-efficient transfer learning but overlook the fundamental transfer characteristics of diffusion models. In this paper, we investigate the transferability of diffusion models and observe a monotonous chain of forgetting trend of transferability along the reverse process. Based on this observation and novel theoretical insights, we present Diff-Tuning, a frustratingly simple transfer approach that leverages the chain of forgetting tendency. Diff-Tuning encourages the fine-tuned model to retain the pre-trained knowledge at the end of the denoising chain close to the generated data while discarding the other noise side. We conduct comprehensive experiments to evaluate Diff-Tuning, including the transfer of pre-trained Diffusion Transformer models to eight downstream generations and the adaptation of Stable Diffusion to five control conditions with ControlNet. Diff-Tuning achieves a 26% improvement over standard fine-tuning and enhances the convergence speed of ControlNet by 24%. Notably, parameter-efficient transfer learning techniques for diffusion models can also benefit from Diff-Tuning.
Paper Structure (60 sections, 1 theorem, 14 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 60 sections, 1 theorem, 14 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Suppose a diffusion model with $\lim\limits_{t\to 0}\alpha_t=1$ and $\lim\limits_{t\to T}\alpha_t=0$ over finite samples, then the ideal denoiser $F$ satisfies

Figures (10)

  • Figure 1: Case study of directly replacing the denoiser with the original pre-trained model on lightly disturbed data (left). The changes in Fréchet Inception Distance (FID) as the denoising steps are incrementally replaced by the original pre-trained model (right).
  • Figure 2: The conceptual illustration of the chain of forgetting (Left). The increasing forgetting tendency as $t$ grows. (a) Build a knowledge bank for the pre-trained model before fine-tuning. (b) Diff-Tuning leverages knowledge retention and reconsolidation, via the chain of forgetting.
  • Figure 3: An example of evaluating dissimilarities between conditions (the Normal condition) to infer the occurrence of sudden convergence.
  • Figure 4: Qualitative compare Diff-Tuning to the standard ControlNet. Red boxes refer to the occurence of "sudden convergence".
  • Figure 5: The compatibility of Diff-Tuning with PEFT (a), and catastrophic forgetting analysis (b-c).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Theorem 1: Chain of Forgetting