Table of Contents
Fetching ...

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, Yinqiang Zheng

TL;DR

This work introduces ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions and sets a new benchmark for high-resolution image generation.

Abstract

Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent global structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation and demonstrates promising efficiency. The project page is https://shuweis.github.io/ResMaster .

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

TL;DR

This work introduces ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions and sets a new benchmark for high-resolution image generation.

Abstract

Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent global structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation and demonstrates promising efficiency. The project page is https://shuweis.github.io/ResMaster .
Paper Structure (25 sections, 6 equations, 9 figures, 2 tables)

This paper contains 25 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Comparisons of $16\times$ ($4096 \times 4096$) image generation based on SDXL podell2023sdxl. Our ResMaster can restore lost details and complex structures (e.g., faces and hands) from low-resolution originals while preserving structural integrity and semantic fidelity compared to ScaleCrafter he2023scalecrafter and DemoFusion du2024demofusion. We upscale the $1024\times1024$ image to the same resolution to facilitate comparison.
  • Figure 2: Multi-aspect-ratio images generated by ResMaster versus SDXL podell2023sdxl. SDXL can synthesize high-quality $1024 \times 1024$ images. ResMaster can further upscale the generated results by 16 times or more without retraining the text-to-image diffusion model. Best viewed ZOOMED-IN.
  • Figure 3: The overall framework of ResMaster. ResMaster is a patch-based denoising diffusion model that includes structural and fine-grained guidance. Fine-grained guidance utilizes an Image Condition Extractor and a Vision-Language Model to extract region-aware image features and re-caption text prompts, respectively. These conditions are then used together via Cross Attention to guide the denoising process of the current patch. Furthermore, structural guidance ensures the structure of the generated image through low-frequency component swapping.
  • Figure 4: The overall pipeline of Structural Guidance. We use 2D Fast Fourier Transform to convert images to the frequency domain and apply a Gaussian low-pass filter to extract low-frequency information for exchange. This low-frequency information is then fused with the original high-frequency information and converted back to the spatial domain.
  • Figure 5: Qualitative comparisons with other methods. All results are presented at a resolution of $4096 \times 4096$ ($16 \times$), with the SDXL results being directly upscaled from $1024 \times 1024$. Some areas have been zoomed in.
  • ...and 4 more figures