Table of Contents
Fetching ...

One-Step Diffusion Distillation via Deep Equilibrium Models

Zhengyang Geng, Ashwini Pokle, J. Zico Kolter

TL;DR

This paper tackles the slow sampling of diffusion models by introducing Generative Equilibrium Transformer (GET), a DEQ-based single-step generative model trained offline from noise/image pairs produced by a pretrained diffusion model. GET leverages a two-component DEQ architecture (InjectionT and EquilibriumT) to map Gaussian noise directly to images, with optional class conditioning, eliminating the need for trajectory information or time embeddings. Empirically, GET achieves strong image quality with substantially higher parameter and data efficiency than online distillation methods, matching or surpassing a 5x larger ViT at lower compute and memory cost, and demonstrates favorable scaling behavior for implicit models on CIFAR-10. These findings highlight the practical relevance of implicit, weight-tied architectures for fast, high-quality generative modeling in resource-constrained settings.

Abstract

Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5\times$ larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available.

One-Step Diffusion Distillation via Deep Equilibrium Models

TL;DR

This paper tackles the slow sampling of diffusion models by introducing Generative Equilibrium Transformer (GET), a DEQ-based single-step generative model trained offline from noise/image pairs produced by a pretrained diffusion model. GET leverages a two-component DEQ architecture (InjectionT and EquilibriumT) to map Gaussian noise directly to images, with optional class conditioning, eliminating the need for trajectory information or time embeddings. Empirically, GET achieves strong image quality with substantially higher parameter and data efficiency than online distillation methods, matching or surpassing a 5x larger ViT at lower compute and memory cost, and demonstrates favorable scaling behavior for implicit models on CIFAR-10. These findings highlight the practical relevance of implicit, weight-tied architectures for fast, high-quality generative modeling in resource-constrained settings.

Abstract

Diffusion models excel at producing high-quality samples but naively require hundreds of iterations, prompting multiple attempts to distill the generation process into a faster network. However, many existing approaches suffer from a variety of challenges: the process for distillation training can be complex, often requiring multiple training stages, and the resulting models perform poorly when utilized in single-step generative applications. In this paper, we introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Of particular importance to our approach is to leverage a new Deep Equilibrium (DEQ) model as the distilled architecture: the Generative Equilibrium Transformer (GET). Our method enables fully offline training with just noise/image pairs from the diffusion model while achieving superior performance compared to existing one-step methods on comparable training budgets. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a larger ViT in terms of FID scores while striking a critical balance of computational cost and image quality. Code, checkpoints, and datasets are available.
Paper Structure (37 sections, 11 equations, 4 figures, 8 tables)

This paper contains 37 sections, 11 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Generative Equilibrium Transformer (GET). (Left) GET consists of two major components: Injection transformer and Equilibrium transformer. The Injection transformer transforms noise embeddings into an input injection for the Equilibrium transformer. The Equilibrium transformer is the equilibrium layer that takes in noise input injection and an optional class embedding and solves for the fixed point. (Right) Details of transformer blocks in the Injection transformer (Inj) and Equilibrium transformer (DEQ), respectively. Blue dotted boxes denote optional class label inputs.
  • Figure 2: Data and Parameter Efficiency of GET:(a) (Left) GET outperforms PD and a 5× larger ViT in fewer iterations, yielding better FID scores. Additionally, longer training times lead to improved FID scores. (b) (Right) Smaller GETs can achieve better FID scores than larger ViTs, demonstrating DEQ's parameter efficiency. Each curve in this plot connects models of different sizes within the same model family at identical training iterations, as indicated by the numbers after the model names in the legend.
  • Figure 3: (a) (Left) Sampling speed of GET: GET can sample faster than large ViTs, while achieving better FID scores. The size of each individual circle is proportional to the model size. For GETs, we vary the number of iterations in the Equilibrium transformer (2 to 6 iterations). The trends indicate that GETs can improve their FID scores by using more compute. (b) (Right) Compute efficiency of GET: Larger GET models use training compute more efficiently. For a given GET, the training budget is calculated from training iterations. Refer to \ref{['table:unconditional-cifar-get-arch']} for the exact size of GET models.
  • Figure 4: Uncurated CIFAR-10 image samples generated by (Left) (a) unconditional GET and (Right) (b) class-conditional GET. Each row corresponds to a class in CIFAR-10.