PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Dongjun Kim; Chieh-Hsin Lai; Wei-Hsiang Liao; Yuhta Takida; Naoki Murata; Toshimitsu Uesaka; Yuki Mitsufuji; Stefano Ermon

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon

TL;DR

PaGoDA tackles the prohibitive training cost of high-resolution diffusion models by a three-stage approach that first trains on downsampled data, then distills to a one-step generator via DDIM inversion, and finally grows a decoder to upsample to high resolutions. The authors provide theoretical guarantees for optimality and training stability under a reconstruction-loss–plus–adversarial-loss objective, and extend the method with classifier-free guidance for text-conditioned generation. Empirically, PaGoDA achieves state-of-the-art FID on ImageNet across resolutions from $64\times64$ to $512\times512$ without CFG, and demonstrates competitive text-to-image results with CFG, while enabling efficient training on modest hardware. This pipeline promises broader access to high-quality diffusion training and scalable, controllable image generation, with potential integration into latent-diffusion-model pipelines and downstream inversion tasks.

Abstract

The diffusion model performs remarkable in generating high-dimensional content but is computationally intensive, especially during training. We propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution. With the proposed pipeline, PaGoDA achieves a $64\times$ reduced cost in training its diffusion model on 8x downsampled data; while at the inference, with the single-step, it performs state-of-the-art on ImageNet across all resolutions from 64x64 to 512x512, and text-to-image. PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models (e.g., Stable Diffusion). The code is available at https://github.com/sony/pagoda.

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

TL;DR

without CFG, and demonstrates competitive text-to-image results with CFG, while enabling efficient training on modest hardware. This pipeline promises broader access to high-quality diffusion training and scalable, controllable image generation, with potential integration into latent-diffusion-model pipelines and downstream inversion tasks.

Abstract

reduced cost in training its diffusion model on 8x downsampled data; while at the inference, with the single-step, it performs state-of-the-art on ImageNet across all resolutions from 64x64 to 512x512, and text-to-image. PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models (e.g., Stable Diffusion). The code is available at https://github.com/sony/pagoda.

Paper Structure (41 sections, 12 theorems, 93 equations, 17 figures, 8 tables)

This paper contains 41 sections, 12 theorems, 93 equations, 17 figures, 8 tables.

Introduction
Preliminary
Progressive Growing of Diffusion Autoencoder
Stage 1: Diffusion Models Trained on Downsampled Data
Stage 2: Diffusion Distillation on Downsampled Data with DDIM Inversion
Stage 3: Progressively Growing Decoder for Super-Resolution
Optimality Guarantee and Training Stability of PaGoDA Pipeline
PaGoDA with Classifier-Free Guidance
Classifier-Free Guided Adversarial Loss
PaGoDA Pipeline with Classifier-Free Guidance
Experiments
PaGoDA Tested on ImageNet without CFG
Quantitative Results
Discussion on Base Resolution
Discussion on Upscaling Capability
...and 26 more sections

Key Result

Theorem 3.1

Let $\lambda>0$. Suppose $D^{*}(G)\in\mathop{\mathrm{arg\,max}}\limits_{D}\mathcal{L}_{\text{adv}}(G,D)$. If both PaGoDA's reconstruction loss and adversarial loss share a common minimizer $G^*$, then $p_{G^*}(\mathbf{x})=p_{\text{data}}(\mathbf{x})$. Here, $p_{G^*}$ is the generative distribution l

Figures (17)

Figure 1: Pipeline overview. PaGoDA deterministically encodes with downsampling followed by DDIM inversion, and constructs its decoder in a progressively growing manner.
Figure 2: (Top) At Stage 2, PaGoDA learns the one-step generator at a base resolution. (Down) At Stage 3, PaGoDA progressively learns for super-resolution by adding additional network blocks.
Figure 3: Effect of the reconstruction loss in Stage 3. Without the reconstruction loss, the object moves at each resolution jump.
Figure 4: The adversarial loss makes PaGoDA competitive with GAN-based super-resolution models in Stage 3.
Figure 5: Comparison of $\mathcal{L}_{\text{dstl}}$ and $\mathcal{L}_{\text{rec}}$, both combined with $\mathcal{L}_{\text{adv}}$, using identical hyperparameters. $\mathcal{L}_{\text{rec}}$ shows the robust performance, also supported by Theorem \ref{['thm:optimality']}.
...and 12 more figures

Theorems & Definitions (14)

Theorem 3.1
Theorem 3.2
Theorem B.1
Lemma B.2: Proposition 3.5. in saumard2014log
Theorem B.3: Variant of Theorem \ref{['th:w_2_convergence']}
Theorem B.4
Lemma B.5
proof
Definition B.1
Lemma B.6
...and 4 more

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

TL;DR

Abstract

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (14)