Creatively Upscaling Images with Global-Regional Priors
Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
TL;DR
C-Upscale introduces a tuning-free diffusion-based upscaling framework that leverages three global-regional priors to achieve high-fidelity and creative ultra-high-resolution images. The Global Structure Prior guides global semantic alignment via the low-frequency content of the low-resolution latent, while Regional Attention Prior and Regional Semantic Prior tailor region-wise consistency and detail generation using cropped cross-attention and prompts produced by Multimodal LLMs. An Attention Composer fuses regional cues to produce coherent, regionally rich outputs, enabling upscales up to 4k×4k (and shown up to higher) with improved fidelity and creativity. Empirical results across synthetic and real-world inputs, comprehensive ablations, and human studies demonstrate superior performance over state-of-the-art tuning-free methods and SR baselines, with favorable efficiency and generalization across diffusion architectures.
Abstract
Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
