Table of Contents
Fetching ...

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, Hongsheng Li

TL;DR

FouriScale tackles the challenge of generating ultra-high-resolution images from diffusion models trained at fixed resolutions by introducing a training-free, frequency-domain approach. It uses a dilation-based convolution combined with a low-pass filter to achieve structural and scale consistency across resolutions, complemented by a padding-then-cropping strategy for arbitrary aspect ratios and a guiding mechanism to improve fidelity. The method demonstrates strong quantitative and qualitative gains over baseline training-free approaches while remaining fast and broadly compatible with existing pre-trained models. Limitations include artifacts at extreme resolutions and applicability mainly to convolution-based diffusion models, suggesting avenues for extending the approach to transformer-based architectures.

Abstract

In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation. With its simplicity and compatibility, our method can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images. The code will be released at https://github.com/LeonHLJ/FouriScale.

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

TL;DR

FouriScale tackles the challenge of generating ultra-high-resolution images from diffusion models trained at fixed resolutions by introducing a training-free, frequency-domain approach. It uses a dilation-based convolution combined with a low-pass filter to achieve structural and scale consistency across resolutions, complemented by a padding-then-cropping strategy for arbitrary aspect ratios and a guiding mechanism to improve fidelity. The method demonstrates strong quantitative and qualitative gains over baseline training-free approaches while remaining fast and broadly compatible with existing pre-trained models. Limitations include artifacts at extreme resolutions and applicability mainly to convolution-based diffusion models, suggesting avenues for extending the approach to transformer-based architectures.

Abstract

In this study, we delve into the generation of high-resolution images from pre-trained diffusion models, addressing persistent challenges, such as repetitive patterns and structural distortions, that emerge when models are applied beyond their trained resolutions. To address this issue, we introduce an innovative, training-free approach FouriScale from the perspective of frequency domain analysis. We replace the original convolutional layers in pre-trained diffusion models by incorporating a dilation technique along with a low-pass operation, intending to achieve structural consistency and scale consistency across resolutions, respectively. Further enhanced by a padding-then-crop strategy, our method can flexibly handle text-to-image generation of various aspect ratios. By using the FouriScale as guidance, our method successfully balances the structural integrity and fidelity of generated images, achieving an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation. With its simplicity and compatibility, our method can provide valuable insights for future explorations into the synthesis of ultra-high-resolution images. The code will be released at https://github.com/LeonHLJ/FouriScale.
Paper Structure (34 sections, 3 theorems, 27 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 3 theorems, 27 equations, 15 figures, 4 tables, 1 algorithm.

Key Result

theorem thmcountertheorem

Spatial down-sampling leads to a reduction in the range of frequencies that the signal can accommodate, particularly at the higher end of the spectrum. This process causes high frequencies to be folded to low frequencies, and superpose onto the original low frequencies. For a one-dimensional signal, where $\mathbb{S}$ dentes the superposing operator, $\Omega_x$ is the sampling rates in $x$ axis, a

Figures (15)

  • Figure 1: Visualization of pattern repetition issue of higher-resolution image synthesis using pre-trained SDXL podell2023sdxl (Train: 1024$\times$1024; Inference:2048$\times$2048). Attn-Entro jin2023training fails to address this problem and ScaleCrafter he2023scalecrafter still struggles with this issue in image details. Our method successfully handles this problem and generates high-quality images without model retraining.
  • Figure 1: Visualization of the design of a low-pass filter. (a) 1D filter for the positive axis. (2) 2D low-pass filter, which is constructed by mirroring the 1D filters and performing an outer product between two 1D filters, in accordance with the settings of the 1D filter.
  • Figure 2: The overview of FouriScale (orange line), which includes a dilation convolution operation (Sec. \ref{['sec:dilated_conv']}) and a low-pass filtering operation (Sec. \ref{['sec:low_pass']}) to achieve structural consistency and scale consistency across resolutions, respectively.
  • Figure 2: Ablation studies on FouriScale components on SD 2.1 model under $16\times$ 1:1 setting.
  • Figure 2: Reference block names of stable diffusion in the following experiment details.
  • ...and 10 more figures

Theorems & Definitions (3)

  • theorem thmcountertheorem
  • lemma thmcounterlemma
  • lemma thmcounterlemma