Table of Contents
Fetching ...

UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, Lei Zhu

TL;DR

UltraPixel addresses the challenge of ultra-high-resolution image generation by integrating a cascade diffusion framework with low-resolution semantic guidance. It learns implicit representations to continuously upsample guidance and uses scale-aware normalization to support multiple resolutions within a shared, compact latent space, achieving 1K–6K outputs efficiently. The approach yields state-of-the-art or competitive perceptual metrics and faster inference compared to several baselines, while requiring modest training data. The work also demonstrates potential for controllable generation and personalization, albeit with attention to dataset quality and responsible use.

Abstract

Ultra-high-resolution image generation poses great challenges, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. We present UltraPixel, a novel architecture utilizing cascade diffusion models to generate high-quality images at multiple resolutions (\textit{e.g.}, 1K to 6K) within a single model, while maintaining computational efficiency. UltraPixel leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the whole generation of highly detailed high-resolution images, significantly reducing complexity. Furthermore, we introduce implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Notably, both low- and high-resolution processes are performed in the most compact space, sharing the majority of parameters with less than 3$\%$ additional parameters for high-resolution outputs, largely enhancing training and inference efficiency. Our model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.

UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks

TL;DR

UltraPixel addresses the challenge of ultra-high-resolution image generation by integrating a cascade diffusion framework with low-resolution semantic guidance. It learns implicit representations to continuously upsample guidance and uses scale-aware normalization to support multiple resolutions within a shared, compact latent space, achieving 1K–6K outputs efficiently. The approach yields state-of-the-art or competitive perceptual metrics and faster inference compared to several baselines, while requiring modest training data. The work also demonstrates potential for controllable generation and personalization, albeit with attention to dataset quality and responsible use.

Abstract

Ultra-high-resolution image generation poses great challenges, such as increased semantic planning complexity and detail synthesis difficulties, alongside substantial training resource demands. We present UltraPixel, a novel architecture utilizing cascade diffusion models to generate high-quality images at multiple resolutions (\textit{e.g.}, 1K to 6K) within a single model, while maintaining computational efficiency. UltraPixel leverages semantics-rich representations of lower-resolution images in the later denoising stage to guide the whole generation of highly detailed high-resolution images, significantly reducing complexity. Furthermore, we introduce implicit neural representations for continuous upsampling and scale-aware normalization layers adaptable to various resolutions. Notably, both low- and high-resolution processes are performed in the most compact space, sharing the majority of parameters with less than 3 additional parameters for high-resolution outputs, largely enhancing training and inference efficiency. Our model achieves fast training with reduced data requirements, producing photo-realistic high-resolution images and demonstrating state-of-the-art performance in extensive experiments.
Paper Structure (16 sections, 5 equations, 30 figures, 7 tables)

This paper contains 16 sections, 5 equations, 30 figures, 7 tables.

Figures (30)

  • Figure 1: The proposed UltraPixel creates highly photo-realistic and detail-rich images at various resolutions. Best viewed zoomed in. All image prompts in this paper are listed in the appendix.
  • Figure 2: Illustration of feature distribution disparity across varying resolutions.
  • Figure 3: Method Overview. Initially, we extract guidance from the low-resolution (LR) image synthesis process and upscale it by learning an implicit neural representation. This upscaled guidance is then integrated into the high-resolution (HR) generation branch. The generated HR latent undergoes a cascade decoding process, ultimately producing a high-resolution image.
  • Figure 4: Illustration of continuous upscaling by implicit neural representation.
  • Figure 5: Architecture details of generative diffusion model.
  • ...and 25 more figures