Table of Contents
Fetching ...

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Cristina N. Vasconcelos, Abdullah Rashwan, Austin Waters, Trevor Walker, Keyang Xu, Jimmy Yan, Rui Qian, Shixin Luo, Zarana Parekh, Andrew Bunner, Hongliang Fei, Roopal Garg, Mandy Guo, Ivana Kajic, Yeqing Li, Henna Nandwani, Jordi Pont-Tuset, Yasumasa Onoe, Sarah Rosston, Su Wang, Wenlei Zhou, Kevin Swersky, David J. Fleet, Jason M. Baldridge, Oliver Wang

TL;DR

This work tackles the challenge of training high-resolution, large-scale pixel-space diffusion models without cascades by decoupling text-to-image alignment from high-resolution rendering. It introduces Shallow-UViT, a lightweight core whose representations are pretrained on large text-image datasets, followed by a two-phase greedy growing procedure that adds high-resolution encoder/decoder layers while freezing the core to preserve learned representations. The approach scales up to 8B parameters and enables end-to-end single-stage generation at $1024\times1024$, as demonstrated by Vermeer, which achieves strong human preference over SDXL while maintaining competitive automated metrics. The method delivers stability with small batch sizes ($256$) and limited regularization, suggesting a practical path to robust, high-quality, non-cascaded pixel-based diffusion at scale.

Abstract

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

TL;DR

This work tackles the challenge of training high-resolution, large-scale pixel-space diffusion models without cascades by decoupling text-to-image alignment from high-resolution rendering. It introduces Shallow-UViT, a lightweight core whose representations are pretrained on large text-image datasets, followed by a two-phase greedy growing procedure that adds high-resolution encoder/decoder layers while freezing the core to preserve learned representations. The approach scales up to 8B parameters and enables end-to-end single-stage generation at , as demonstrated by Vermeer, which achieves strong human preference over SDXL while maintaining competitive automated metrics. The method delivers stability with small batch sizes () and limited regularization, suggesting a practical path to robust, high-quality, non-cascaded pixel-based diffusion at scale.

Abstract

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.
Paper Structure (26 sections, 12 figures, 12 tables)

This paper contains 26 sections, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Shallow-UViT architecture: The input image grid is quickly reduced at the entry convolution, while a single residual block with no subsampling layers is used as a shallow encoder and decoder. The layers within the core components (in light green) are reused in the final end-to-end architecture, increasing its training stability, while remaining layers are discarded.
  • Figure 2: Qualitative comparison of models with core components of increasing size -- Shallow-UViTs trained at $64\times 64$ pixels using CC12M dataset only. Prompts: A sloth running a marathon, surprisingly outrunning all competitors. A hand spread out on a wall. DSLR photograph. Close-up portrait of a ballerina in mid-performance, with high motion and dramatic lighting. Word art of "happy birthday", with a smiling panda wearing a party hat, surrounded by gift boxes and a birthday cake. Four dogs on the street.
  • Figure 3: Overfitting and memorization of Shallow-UViT XHuge trained on CC12M. Prompts: (top) A group of construction workers in the style of 'The Night Watch' by Rembrandt.; (middle) A dynamic rendition of a racing cyclist leading their team through a mountain pass, rendered in the style of 'Napoleon Crossing the Alps' by Jacques-Louis David.; (bottom) A group of friends enjoying a summer day at a riverside restaurant in the style of 'A Sunday Afternoon on the Island of La Grande Jatte' by Georges Seurat.
  • Figure 4: Measuring the impact of scaling on the counting task. Using 59 systematic prompts describing 1-5 objects. Five human annotators reviewed each image (95% bootstrapped confidence intervals are shown). Models with larger core components are observed to perform better on counting. Sample prompt: 3 apples.
  • Figure 5: On catastrophic forgetting during early steps of finetuning: the pretrained representation quickly deteriorates due to noise introduced by the random weights from newly added layers. (from left to right) $64\times64$ image produced by the pretrained Shallow-Unet-Huge; followed by $512\times512$ images (in green) produced at early steps of finetuning (ft.) the core representation in an E2e model; and (in blue) freezing the core layers. Differences better observed zooming in. Distinctions are more readily discerned when examining in closer detail. Prompts: A close-up portrait of a butterfly, revealing the intricate patterns and textures on its wings in exquisite detail.A loving mother kangaroo carrying her joey in her pouch..A determined sea turtle swimming against the ocean current.A graceful hummingbird hovering near a bright pink flower.A dark and gothic illustration of a raven perched on a skull.A colorful macaw soaring through a lush, vibrant rainforest.A playful wolf pup chasing its own tail.
  • ...and 7 more figures