DiffRatio: Training One-Step Diffusion Models Without Teacher Supervision
Wenlin Chen, Mingtian Zhang, Jiajun He, Zijing Ou, José Miguel Hernández-Lobato, Bernhard Schölkopf, David Barber
TL;DR
DiffRatio tackles gradient bias in teacher-supervised one-step diffusion distillation by directly learning the gradient of the log-density ratio between the one-step model and real data across diffusion time. It replaces the dual-score difference with a single density-ratio classifier, yielding a consistent, bias-reduced signal and a lighter training footprint. Empirically, DiffRatio achieves competitive one-step generation on CIFAR-10 and ImageNet (64×64 and 512×512) without teacher supervision, often outperforming many teacher-based distillation methods while using a smaller auxiliary network. The approach broadens the set of trainable divergences in diffusion-based generation and offers practical efficiency benefits for high-resolution synthesis.
Abstract
Score-based distillation methods (e.g., variational score distillation) train one-step diffusion models by first pre-training a teacher score model and then distilling it into a one-step student model. However, the gradient estimator in the distillation stage usually suffers from two sources of bias: (1) biased teacher supervision due to score estimation error incurred during pre-training, and (2) the student model's score estimation error during distillation. These biases can degrade the quality of the resulting one-step diffusion model. To address this, we propose DiffRatio, a new framework for training one-step diffusion models: instead of estimating the teacher and student scores independently and then taking their difference, we directly estimate the score difference as the gradient of a learned log density ratio between the student and data distributions across diffusion time steps. This approach greatly simplifies the training pipeline, significantly reduces gradient estimation bias, and improves one-step generation quality. Additionally, it also reduces auxiliary network size by using a lightweight density-ratio network instead of two full score networks, which improves computational and memory efficiency. DiffRatio achieves competitive one-step generation results on CIFAR-10 and ImageNet (64x64 and 512x512), outperforming most teacher-supervised distillation approaches.
