Table of Contents
Fetching ...

Creatively Upscaling Images with Global-Regional Priors

Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei

TL;DR

C-Upscale introduces a tuning-free diffusion-based upscaling framework that leverages three global-regional priors to achieve high-fidelity and creative ultra-high-resolution images. The Global Structure Prior guides global semantic alignment via the low-frequency content of the low-resolution latent, while Regional Attention Prior and Regional Semantic Prior tailor region-wise consistency and detail generation using cropped cross-attention and prompts produced by Multimodal LLMs. An Attention Composer fuses regional cues to produce coherent, regionally rich outputs, enabling upscales up to 4k×4k (and shown up to higher) with improved fidelity and creativity. Empirical results across synthetic and real-world inputs, comprehensive ablations, and human studies demonstrate superior performance over state-of-the-art tuning-free methods and SR baselines, with favorable efficiency and generalization across diffusion architectures.

Abstract

Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.

Creatively Upscaling Images with Global-Regional Priors

TL;DR

C-Upscale introduces a tuning-free diffusion-based upscaling framework that leverages three global-regional priors to achieve high-fidelity and creative ultra-high-resolution images. The Global Structure Prior guides global semantic alignment via the low-frequency content of the low-resolution latent, while Regional Attention Prior and Regional Semantic Prior tailor region-wise consistency and detail generation using cropped cross-attention and prompts produced by Multimodal LLMs. An Attention Composer fuses regional cues to produce coherent, regionally rich outputs, enabling upscales up to 4k×4k (and shown up to higher) with improved fidelity and creativity. Empirical results across synthetic and real-world inputs, comprehensive ablations, and human studies demonstrate superior performance over state-of-the-art tuning-free methods and SR baselines, with favorable efficiency and generalization across diffusion architectures.

Abstract

Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.

Paper Structure

This paper contains 18 sections, 7 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: C-Upscale for high resolution image generation. SDXL generates images with resolutions up to $1,024^2$ and C-Upscale subsequently upscales images at 4$\times$, 16$\times$, and even 256$\times$. The regions in red boxes are shown in zoom-in view (right side).
  • Figure 2: Visualization of the partitioning process, which divides the latent space into smaller overlapped cropped regions. The region size specifies the dimensions of each cropped region, while the overlap size defines the extent of overlap between adjacent regions.
  • Figure 3: The overall framework of C-Upscale for high-resolution generation with global-regional priors. The image is divided into overlapping regions and each region is denoised individually. We leverage Multimodal LLM to generate prompts (Regional Semantic Prior) for each regional image which encourages creative details. Meanwhile, we extract the attention scores (Regional Attention Prior) based on the low-resolution image and global prompt for semantic alignment. These two regional priors are joined with the "Attention Composer" module. Finally, the denoised latents at each step are aligned with the low-frequency component of low-resolution latent (Global Structure Prior) to ensure structure alignment.
  • Figure 4: Attention Composer between regional semantic prior (left) and regional attention prior (right). On the left, attention computation occurs between regional latent features and regional prompts. On the right, the regional attention scores from the global textual condition are utilized. The attended features from both sides are combined as the final cross-attention features.
  • Figure 5: Two visual examples of upscaling the low-resolution images (1,024$\times$1,024 outputs of SDXL) into higher-resolution ones (4,096$\times$4,096) by different tuning-free methods. The regions in red and blue boxes are presented in zoom-in view.
  • ...and 5 more figures