Nested Diffusion Models Using Hierarchical Latent Priors
Xiao Zhang, Ruoxi Jiang, Rebecca Willett, Michael Maire
TL;DR
The paper introduces nested diffusion models that generate images by cascading diffusion processes across a semantic latent hierarchy anchored to a frozen visual encoder. By extracting patch-based hierarchical latents, applying progressive channel reduction with SVD and injecting Gaussian noise, the method preserves global structure while enabling detailed refinement. Empirical results on ImageNet-1K and COCO demonstrate substantial quality gains, with deeper hierarchies yielding strong improvements in unconditional and conditional generation and only modest increases in computation. The work highlights the importance of semantic representations and structured latent neighborhoods for diffusion-based generation and shows that unconditional generation can outperform conditional baselines under a balanced CFG regime.
Abstract
We introduce nested diffusion models, an efficient and powerful hierarchical generative framework that substantially enhances the generation quality of diffusion models, particularly for images of complex scenes. Our approach employs a series of diffusion models to progressively generate latent variables at different semantic levels. Each model in this series is conditioned on the output of the preceding higher-level models, culminating in image generation. Hierarchical latent variables guide the generation process along predefined semantic pathways, allowing our approach to capture intricate structural details while significantly improving image quality. To construct these latent variables, we leverage a pre-trained visual encoder, which learns strong semantic visual representations, and modulate its capacity via dimensionality reduction and noise injection. Across multiple datasets, our system demonstrates significant enhancements in image quality for both unconditional and class/text conditional generation. Moreover, our unconditional generation system substantially outperforms the baseline conditional system. These advancements incur minimal computational overhead as the more abstract levels of our hierarchy work with lower-dimensional representations.
