Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model
Xiyuan Wang, Muhan Zhang
TL;DR
This work tackles the inefficiency of three-part latent diffusion pipelines by unifying the encoder, decoder, and diffusion model into a single end-to-end network. It identifies latent collapse as a key barrier during naive joint training and reframes the problem through self-distillation theory, introducing Diffusion as Self-Distillation (DSD). The core innovations are decoupling the target latent via stop-gradient and a loss transformation that recasts velocity prediction as denoising the latent, enabling stable end-to-end training on ViT backbones. Empirical results on ImageNet 256×256 show strong, parameter-efficient generation across multiple model sizes, with DSD-B (205M parameters) achieving competitive FID scores without classifier-free guidance. The approach promises a parameter-efficient path toward unified, foundation-model-like diffusion capable of scalable unsupervised learning and generation, albeit with resource constraints and the need for further unsupervised validation.
Abstract
Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.
