Table of Contents
Fetching ...

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan

TL;DR

ScaleCrafter addresses the challenge of generating ultra-high-resolution images from pre-trained diffusion models by diagnosing object repetition as a consequence of a limited convolutional receptive field. It introduces a tuning-free pipeline comprising dynamic re-dilation, convolution dispersion, and noise-damped classifier-free guidance to enlarge the perceptual field during inference without retraining. The method achieves up to 4096×4096 (16× training resolution) and supports arbitrary aspect ratios, while outperforming training-free baselines and approaching diffusion-super-resolution in texture fidelity. It also demonstrates applicability to text-to-video diffusion, suggesting a broadly applicable strategy for ultra-high-resolution synthesis using pre-trained priors.

Abstract

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

TL;DR

ScaleCrafter addresses the challenge of generating ultra-high-resolution images from pre-trained diffusion models by diagnosing object repetition as a consequence of a limited convolutional receptive field. It introduces a tuning-free pipeline comprising dynamic re-dilation, convolution dispersion, and noise-damped classifier-free guidance to enlarge the perceptual field during inference without retraining. The method achieves up to 4096×4096 (16× training resolution) and supports arbitrary aspect ratios, while outperforming training-free baselines and approaching diffusion-super-resolution in texture fidelity. It also demonstrates applicability to text-to-video diffusion, suggesting a broadly applicable strategy for ultra-high-resolution synthesis using pre-trained priors.

Abstract

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
Paper Structure (24 sections, 5 equations, 14 figures, 13 tables)

This paper contains 24 sections, 5 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Structure repetition issue of higher-resolution generation (Train: 512$^2$; Inference: 512$\times$1024 and 1024$^2$). Altering the scaling factor of attention trainfree-variablesize, and joint diffusion approaches including MultiDiffusion multidiffusion and SyncDiffusion syncdiffusion fails to address this problem. While our simple re-dilation successfully solves this problem and yields structure and semantic correct images, and at meanwhile require no optimization and tuning cost.
  • Figure 2: Our method can generate $4096 \times 4096$ images, 16$\times$ higher than the training resolution.
  • Figure 3: (a) The first row shows re-dilation. Given a pre-trained kernel trained on low-resolution data, we fix the parameters and insert spaces into kernel elements during test time. The second row shows fractional dilated convolution. For each entry of the convolution kernel, we compute the input feature with features near the kernel entry center with bilinear interpolation. This is equivalent to stretch input feature maps and uses a rounded-up dilation scale before the convolution operation. (b) Dispersed convolution can enlarge a pre-trained kernel with a specific scale. We use structure-level calibration to adapt to a new perception field when the input feature dimension is larger and use pixel-level calibration to preserve local information processing ability.
  • Figure 4: left: Samples by increasing perception field in middle blocks and most blocks (middle and outskirt blocks). The middle blocks-only setting fails to produce the correct small object structures. right: The first row shows the predicted original sample using noise-damped classifier-free guidance. The second and third rows show the prediction using $\tilde{\bm{\epsilon}}_\theta(\boldsymbol{x}_t, y)$ and $\tilde{\bm{\epsilon}}_\theta(\boldsymbol{x}_t)$. $\tilde{\bm{\epsilon}}_\theta(\boldsymbol{x}_t, y)$ and $\tilde{\bm{\epsilon}}_\theta(\boldsymbol{x}_t)$ fails to remove noise during sampling. However, their predictions exhibit a very similar noise pattern. The fourth row illustrates $\vert \tilde{\bm{\epsilon}}_\theta(\boldsymbol{x}_t, y) - \tilde{\bm{\epsilon}}_\theta(\boldsymbol{x}_t)\vert$. The erroneous noise prediction vanishes and we can utilize the remaining useful information.
  • Figure 5: Visual comparisons between ① ours, ② directly inferencing SD and ③ Attn-SF trainfree-variablesize in 4$\times$, 8$\times$ and 16$\times$ settings and three Stable Diffusion models.
  • ...and 9 more figures