Table of Contents
Fetching ...

Nested Diffusion Models Using Hierarchical Latent Priors

Xiao Zhang, Ruoxi Jiang, Rebecca Willett, Michael Maire

TL;DR

The paper introduces nested diffusion models that generate images by cascading diffusion processes across a semantic latent hierarchy anchored to a frozen visual encoder. By extracting patch-based hierarchical latents, applying progressive channel reduction with SVD and injecting Gaussian noise, the method preserves global structure while enabling detailed refinement. Empirical results on ImageNet-1K and COCO demonstrate substantial quality gains, with deeper hierarchies yielding strong improvements in unconditional and conditional generation and only modest increases in computation. The work highlights the importance of semantic representations and structured latent neighborhoods for diffusion-based generation and shows that unconditional generation can outperform conditional baselines under a balanced CFG regime.

Abstract

We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.

Nested Diffusion Models Using Hierarchical Latent Priors

TL;DR

The paper introduces nested diffusion models that generate images by cascading diffusion processes across a semantic latent hierarchy anchored to a frozen visual encoder. By extracting patch-based hierarchical latents, applying progressive channel reduction with SVD and injecting Gaussian noise, the method preserves global structure while enabling detailed refinement. Empirical results on ImageNet-1K and COCO demonstrate substantial quality gains, with deeper hierarchies yielding strong improvements in unconditional and conditional generation and only modest increases in computation. The work highlights the importance of semantic representations and structured latent neighborhoods for diffusion-based generation and shows that unconditional generation can outperform conditional baselines under a balanced CFG regime.

Abstract

We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.

Paper Structure

This paper contains 18 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Image generation quality when scaling our nested diffusion models on ImageNet-1K dataset. The deeper hierarchies we build lead to a slight increase in computational overhead (particularly when $L \leq 4$), as measured by GFlops, while significantly improving the generation quality. Compared to the single-level baseline model using comparable GFlops, our 5-level unconditional system significantly improves the performance w/o classifier-free guidance (CFG) by reducing FID from 45.19 to 11.05, bypassing the class-conditional baseline of 19.74.
  • Figure 2: Nested diffusion architecture.Left: We train a sequence of diffusion models to generate a hierarchical collection of latent representations $\{{\mathbf{z}}_3, {\mathbf{z}}_2, {\mathbf{z}}_1 = {\mathbf{x}}\}$ of increasing dimensionality up to an image ${\mathbf{z}}_1 = {\mathbf{x}}$. Generated latents serve as conditional inputs (dotted lines) to diffusion models at subsequent levels, with separately parameterized noising processes, $\hat{{\mathbf{z}}}_{l}\sim \mathcal{N}({\mathbf{z}}_l, \sigma_{l}^2\mathbf{I})$, controlling the information capacity of these signals. Right: A pre-trained, frozen visual encoder provides target latent representations for each level of the hierarchy. To construct these latent features, we run the encoder on patchified images, reducing patch size and applying dimensionality reduction across feature channels in order to shift focus from local details to global semantics. Upper level targets encode more abstract semantics and, being lower-dimensional vectors, are less computationally expensive to synthesize, making hierarchical generation fast.
  • Figure 3: Feature compression via Gaussian noise. For a two-level hierarchical generator ($L=2$), we generate images conditioned on an oracle CLIP feature ${\mathbf{z}}_2$, inferred from input images, with feature channels reduced from 512 to 256 dimensions via SVD. Without noise ($\sigma_2 = 0$) added to ${\mathbf{z}}_2$, the generator ${\bm{D}}_{\theta_1}$ degenerates to an autoencoder that nearly reconstructs the input; adding Gaussian noise ($\sigma_2 = 0.5$) to ${\mathbf{z}}_2$ limits feature information, allowing for generation of new content.
  • Figure 4: Visualization of K-Nearest Neighbors (KNN) with different sources of latent features. For each input image, we display neighboring images, based on features extracted from two types of visual representations: CLIP representations, and VAE bottlenecks. Unlike the VAE, which focuses on low-level visual structures, CLIP emphasizes semantic representations, yielding more meaningful nearest neighbors. Our experiments demonstrate that running a diffusion model on a latent space with well-structured neighbors is essential for enhancing generation quality.
  • Figure 5: Visualization of unconditional image generation on ImageNet-1K. We present visualizations of images generated by hierarchical diffusion models containing from $2$ to $5$ levels, demonstrating that image quality improves as the depth of the hierarchy increases.
  • ...and 3 more figures