Table of Contents
Fetching ...

DiffRatio: Training One-Step Diffusion Models Without Teacher Supervision

Wenlin Chen, Mingtian Zhang, Jiajun He, Zijing Ou, José Miguel Hernández-Lobato, Bernhard Schölkopf, David Barber

TL;DR

DiffRatio tackles gradient bias in teacher-supervised one-step diffusion distillation by directly learning the gradient of the log-density ratio between the one-step model and real data across diffusion time. It replaces the dual-score difference with a single density-ratio classifier, yielding a consistent, bias-reduced signal and a lighter training footprint. Empirically, DiffRatio achieves competitive one-step generation on CIFAR-10 and ImageNet (64×64 and 512×512) without teacher supervision, often outperforming many teacher-based distillation methods while using a smaller auxiliary network. The approach broadens the set of trainable divergences in diffusion-based generation and offers practical efficiency benefits for high-resolution synthesis.

Abstract

Score-based distillation methods (e.g., variational score distillation) train one-step diffusion models by first pre-training a teacher score model and then distilling it into a one-step student model. However, the gradient estimator in the distillation stage usually suffers from two sources of bias: (1) biased teacher supervision due to score estimation error incurred during pre-training, and (2) the student model's score estimation error during distillation. These biases can degrade the quality of the resulting one-step diffusion model. To address this, we propose DiffRatio, a new framework for training one-step diffusion models: instead of estimating the teacher and student scores independently and then taking their difference, we directly estimate the score difference as the gradient of a learned log density ratio between the student and data distributions across diffusion time steps. This approach greatly simplifies the training pipeline, significantly reduces gradient estimation bias, and improves one-step generation quality. Additionally, it also reduces auxiliary network size by using a lightweight density-ratio network instead of two full score networks, which improves computational and memory efficiency. DiffRatio achieves competitive one-step generation results on CIFAR-10 and ImageNet (64x64 and 512x512), outperforming most teacher-supervised distillation approaches.

DiffRatio: Training One-Step Diffusion Models Without Teacher Supervision

TL;DR

DiffRatio tackles gradient bias in teacher-supervised one-step diffusion distillation by directly learning the gradient of the log-density ratio between the one-step model and real data across diffusion time. It replaces the dual-score difference with a single density-ratio classifier, yielding a consistent, bias-reduced signal and a lighter training footprint. Empirically, DiffRatio achieves competitive one-step generation on CIFAR-10 and ImageNet (64×64 and 512×512) without teacher supervision, often outperforming many teacher-based distillation methods while using a smaller auxiliary network. The approach broadens the set of trainable divergences in diffusion-based generation and offers practical efficiency benefits for high-resolution synthesis.

Abstract

Score-based distillation methods (e.g., variational score distillation) train one-step diffusion models by first pre-training a teacher score model and then distilling it into a one-step student model. However, the gradient estimator in the distillation stage usually suffers from two sources of bias: (1) biased teacher supervision due to score estimation error incurred during pre-training, and (2) the student model's score estimation error during distillation. These biases can degrade the quality of the resulting one-step diffusion model. To address this, we propose DiffRatio, a new framework for training one-step diffusion models: instead of estimating the teacher and student scores independently and then taking their difference, we directly estimate the score difference as the gradient of a learned log density ratio between the student and data distributions across diffusion time steps. This approach greatly simplifies the training pipeline, significantly reduces gradient estimation bias, and improves one-step generation quality. Additionally, it also reduces auxiliary network size by using a lightweight density-ratio network instead of two full score networks, which improves computational and memory efficiency. DiffRatio achieves competitive one-step generation results on CIFAR-10 and ImageNet (64x64 and 512x512), outperforming most teacher-supervised distillation approaches.

Paper Structure

This paper contains 26 sections, 1 theorem, 50 equations, 8 figures, 5 tables.

Key Result

Theorem 3.1

Fix any noise level $t>0$. Let $p_1(x_t)=p_d(x_t)$ and $p_0(x_t)=q_\theta(x_t)$, and let $c_\eta(x_t,t)$ be a well-specified time-conditional logistic classifier trained by maximum likelihood to distinguish samples from $p_1$ versus $p_0$ with nonzero class priors. Under standard regularity conditio converges (in probability) to the true $\log \frac{q_\theta(x_t)}{p_d(x_t)}$. Consequently, $\nabla

Figures (8)

  • Figure 1: Images generated by DiffRatio on ImageNet $512{\times}512$ with a single step (FID=1.41).
  • Figure 2: Score difference estimation accuracy on a 2D mixture of Gaussians problem. Our density-ratio-based method achieves lower L2 error (left) and higher cosine similarity (right) with the ground-truth score difference than VSD, which suffers from accumulated errors in separate teacher and student score estimation.
  • Figure 3: Images generated by DiffRatio-DiJS on CIFAR-10 and ImageNet $64{\times}64$. Each image is generated with a single NFE.
  • Figure 4: Images generated by DiffRatio-DiJS-M on ImageNet $512{\times}512$ (FID=1.41). Each image is generated with 1 step (1 NFE).
  • Figure 5: Visualization of different initializations and collapsed samples on CIFAR-10.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 3.1: Consistency of the diffusive density-ratio estimator