Table of Contents
Fetching ...

Directly Denoising Diffusion Models

Dan Zhang, Jingjing Wang, Feng Luo

TL;DR

DDDM presents a streamlined diffusion-based approach that enables high-quality image generation with few-step sampling while retaining multi-step refinement. By conditioning the diffusion model on an estimated target from the previous training iteration and iteratively refining the inferred $\mathbf{x}_0$ via a neural PF-ODE predictor, it eliminates the need for bespoke samplers or teacher-student distillation. The introduction of Pseudo-LPIPS enhances robustness and perceptual alignment, with strong empirical results on CIFAR-10 and ImageNet-64×64 showing competitive FID and IS scores across one-step, two-step, and 1000-step sampling regimes. The work highlights a practical, memory-aware training paradigm and points to future directions in continuous-time diffusion and unbiased evaluation. Overall, DDDM demonstrates that a simple, iterative conditioning strategy can achieve state-of-the-art-like performance with a much simpler sampling pipeline.

Abstract

In this paper, we present the Directly Denoising Diffusion Model (DDDM): a simple and generic approach for generating realistic images with few-step sampling, while multistep sampling is still preserved for better performance. DDDMs require no delicately designed samplers nor distillation on pre-trained distillation models. DDDMs train the diffusion model conditioned on an estimated target that was generated from previous training iterations of its own. To generate images, samples generated from the previous time step are also taken into consideration, guiding the generation process iteratively. We further propose Pseudo-LPIPS, a novel metric loss that is more robust to various values of hyperparameter. Despite its simplicity, the proposed approach can achieve strong performance in benchmark datasets. Our model achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing those obtained from GANs and distillation-based models. By extending the sampling to 1000 steps, we further reduce FID score to 1.79, aligning with state-of-the-art methods in the literature. For ImageNet 64x64, our approach stands as a competitive contender against leading models.

Directly Denoising Diffusion Models

TL;DR

DDDM presents a streamlined diffusion-based approach that enables high-quality image generation with few-step sampling while retaining multi-step refinement. By conditioning the diffusion model on an estimated target from the previous training iteration and iteratively refining the inferred via a neural PF-ODE predictor, it eliminates the need for bespoke samplers or teacher-student distillation. The introduction of Pseudo-LPIPS enhances robustness and perceptual alignment, with strong empirical results on CIFAR-10 and ImageNet-64×64 showing competitive FID and IS scores across one-step, two-step, and 1000-step sampling regimes. The work highlights a practical, memory-aware training paradigm and points to future directions in continuous-time diffusion and unbiased evaluation. Overall, DDDM demonstrates that a simple, iterative conditioning strategy can achieve state-of-the-art-like performance with a much simpler sampling pipeline.

Abstract

In this paper, we present the Directly Denoising Diffusion Model (DDDM): a simple and generic approach for generating realistic images with few-step sampling, while multistep sampling is still preserved for better performance. DDDMs require no delicately designed samplers nor distillation on pre-trained distillation models. DDDMs train the diffusion model conditioned on an estimated target that was generated from previous training iterations of its own. To generate images, samples generated from the previous time step are also taken into consideration, guiding the generation process iteratively. We further propose Pseudo-LPIPS, a novel metric loss that is more robust to various values of hyperparameter. Despite its simplicity, the proposed approach can achieve strong performance in benchmark datasets. Our model achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing those obtained from GANs and distillation-based models. By extending the sampling to 1000 steps, we further reduce FID score to 1.79, aligning with state-of-the-art methods in the literature. For ImageNet 64x64, our approach stands as a competitive contender against leading models.
Paper Structure (17 sections, 2 theorems, 28 equations, 17 figures, 3 tables, 2 algorithms)

This paper contains 17 sections, 2 theorems, 28 equations, 17 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.2

If the loss function $\mathcal{L}_{\text{DDDM}}^{(n)}\left(\boldsymbol{\theta}\right)\rightarrow 0$ as $n\rightarrow \infty$, it can be shown that as $n\rightarrow \infty$,

Figures (17)

  • Figure 1: An illustration of DDDM. For current training epoch $n$, our model takes noisy data $\mathbf{x}_t$ and timestep $t$, as well as the estimated target from previous epoch $\mathbf{x}_0^{(n-1)}$ as inputs, predicts the new approximation $\mathbf{x}_0^{(n)}$, which will be utilized in the next training epoch. Through such an iterative process, our approximated result moves gradually towards real data $\mathbf{x}_0$.
  • Figure 2: Ablation analysis for our proposed Pseudo-LPIPS metric. (a) While LPIPS and Pseudo-Huber perform closely, Pseudo-LPIPS further reduces FID to under 5. (b) Pseudo-LPIPS outperforms LPIPS with various values of hyperparameter $c$, where $c=0.000069$ is the best. The y-axis for both figures is scaled logarithmically for better visualization.
  • Figure 3: One-step and two-step samples from DDDM-deep model trained on ImageNet 64x64
  • Figure 4: One-step and two-step samples from DDDM-deep model trained on CIFAR-10
  • Figure 5: FID w.r.t inference iterations.
  • ...and 12 more figures

Theorems & Definitions (5)

  • Definition 3.1
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • proof