Table of Contents
Fetching ...

Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model

Xiyuan Wang, Muhan Zhang

TL;DR

This work tackles the inefficiency of three-part latent diffusion pipelines by unifying the encoder, decoder, and diffusion model into a single end-to-end network. It identifies latent collapse as a key barrier during naive joint training and reframes the problem through self-distillation theory, introducing Diffusion as Self-Distillation (DSD). The core innovations are decoupling the target latent via stop-gradient and a loss transformation that recasts velocity prediction as denoising the latent, enabling stable end-to-end training on ViT backbones. Empirical results on ImageNet 256×256 show strong, parameter-efficient generation across multiple model sizes, with DSD-B (205M parameters) achieving competitive FID scores without classifier-free guidance. The approach promises a parameter-efficient path toward unified, foundation-model-like diffusion capable of scalable unsupervised learning and generation, albeit with resource constraints and the need for further unsupervised validation.

Abstract

Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet $256\times 256$ conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.

Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model

TL;DR

This work tackles the inefficiency of three-part latent diffusion pipelines by unifying the encoder, decoder, and diffusion model into a single end-to-end network. It identifies latent collapse as a key barrier during naive joint training and reframes the problem through self-distillation theory, introducing Diffusion as Self-Distillation (DSD). The core innovations are decoupling the target latent via stop-gradient and a loss transformation that recasts velocity prediction as denoising the latent, enabling stable end-to-end training on ViT backbones. Empirical results on ImageNet 256×256 show strong, parameter-efficient generation across multiple model sizes, with DSD-B (205M parameters) achieving competitive FID scores without classifier-free guidance. The approach promises a parameter-efficient path toward unified, foundation-model-like diffusion capable of scalable unsupervised learning and generation, albeit with resource constraints and the need for further unsupervised validation.

Abstract

Standard Latent Diffusion Models rely on a complex, three-part architecture consisting of a separate encoder, decoder, and diffusion network, which are trained in multiple stages. This modular design is computationally inefficient, leads to suboptimal performance, and prevents the unification of diffusion with the single-network architectures common in vision foundation models. Our goal is to unify these three components into a single, end-to-end trainable network. We first demonstrate that a naive joint training approach fails catastrophically due to ``latent collapse'', where the diffusion training objective interferes with the network's ability to learn a good latent representation. We identify the root causes of this instability by drawing a novel analogy between diffusion and self-distillation based unsupervised learning method. Based on this insight, we propose Diffusion as Self-Distillation (DSD), a new framework with key modifications to the training objective that stabilize the latent space. This approach enables, for the first time, the stable end-to-end training of a single network that simultaneously learns to encode, decode, and perform diffusion. DSD achieves outstanding performance on the ImageNet conditional generation task: FID=13.44/6.38/4.25 with only 42M/118M/205M parameters and 50 training epochs on ImageNet, without using classifier-free-guidance.

Paper Structure

This paper contains 34 sections, 15 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Analogy between Self-Distillation (SD) based unsupervised learning and the standard diffusion model loss, comparing four architectural paradigms. (1) SD: The Online Encoder ($E_1$) generates a latent $\mathbf{z}_1$ from an augmented image ($\mathbf{x}^{+}$, e.g., with a random mask). $\mathbf{z}_1$ is then passed through a predictor ($P$) and trained to match the target latent $\mathbf{z}_2$ provided by the Target Encoder ($E_2$), which is a frozen copy updated via Exponential Moving Average (EMA). SD is a stable method for producing high-quality latent representations. (2) Latent Diffusion: The encoder ($E$) is frozen, and only the diffusion model ($v_\theta$) is trained. The clean image latent $\mathbf{z}$ is used in two ways: input $\mathbf{z}_t$ for the diffusion model and target $\mathbf{z}-\mathbf{\epsilon}$ in the $\text{L}2$ loss. (3) Vanilla Joint Training: Unlike LDM, the encoder ($E$) is unfrozen and optimized with the diffusion loss. This setup leads directly to latent collapse and poor image quality. The reasons for this collapse are structurally identified in this figure (indicated by the colored parts). 4. Our Diffusion As Self-Distillation (DSD): By implementing solutions that fix the identified collapse mechanisms, our DSD framework achieves a unified architecture where the encoder, predictor, and diffusion backbone are all integrated into a single Vision Transformer (ViT) with task-specific heads.
  • Figure 2: Trajectory of latent representations' effective rank and reconstruction loss during end-to-end training process of a ViT-S model on ImageNet. A stable training process is featured with decreasing reconstruction loss (orange lines) and high effective rank (blue lines).
  • Figure 3: Geometric interpretation of the loss transformation. The velocity $\mathbf{v}=\mathbf{z}-\mathbf{\epsilon}$ is proportional to $\mathbf{z}-\mathbf{z}_t$. Therefore, training a model to predict the clean end point $\mathbf{z}$ (denoising) is equivalent to training a model to predict the velocity.
  • Figure 4: Our Unified DSD Architecture. Different ViT block in the figure are essentially one single ViT backbone with different heads. The green block are three objectives in DSD. The yellow blocks are modules updated by gradient descent. The blue blocks are updated by EMA. The black lines denote ordinary forward process. The grey dot line are forward process with stop gradient.
  • Figure 5: Qualitative Results on Imagenet 256 $\times$ 256 using DSD-B. We use a classifier-free guidance scale $7.3$.