Table of Contents
Fetching ...

FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

Zhengqiang Zhang, Ruihuang Li, Lei Zhang

TL;DR

FreCaS tackles the challenge of generating high-resolution images with pretrained diffusion models without retraining. It introduces frequency-aware cascaded sampling, which expands resolution and frequency bands across multiple stages, guided by FA-CFG to emphasize newly introduced frequencies and CA-map reuse to preserve layout. Empirical results across SD2.1, SDXL, and SD3 show FreCaS achieves superior image quality (lower FID_b, FID_p; competitive CLIP) and significant latency reductions (roughly 2–6x faster than strong baselines). The approach enables scalable, training-free high-resolution diffusion-based synthesis with broad applicability to advanced models such as SD3head, offering practical impact for real-world generation tasks.

Abstract

While image generation with diffusion models has achieved a great success, generating images of higher resolution than the training size remains a challenging task due to the high computational cost. Current methods typically perform the entire sampling process at full resolution and process all frequency components simultaneously, contradicting with the inherent coarse-to-fine nature of latent diffusion models and wasting computations on processing premature high-frequency details at early diffusion stages. To address this issue, we introduce an efficient $\textbf{Fre}$quency-aware $\textbf{Ca}$scaded $\textbf{S}$ampling framework, $\textbf{FreCaS}$ in short, for higher-resolution image generation. FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions, progressively expanding frequency bands and refining the corresponding details. We propose an innovative frequency-aware classifier-free guidance (FA-CFG) strategy to assign different guidance strengths for different frequency components, directing the diffusion model to add new details in the expanded frequency domain of each stage. Additionally, we fuse the cross-attention maps of previous and current stages to avoid synthesizing unfaithful layouts. Experiments demonstrate that FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed. In particular, FreCaS is about 2.86$\times$ and 6.07$\times$ faster than ScaleCrafter and DemoFusion in generating a 2048$\times$2048 image using a pre-trained SDXL model and achieves an FID$_b$ improvement of 11.6 and 3.7, respectively. FreCaS can be easily extended to more complex models such as SD3. The source code of FreCaS can be found at https://github.com/xtudbxk/FreCaS.

FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

TL;DR

FreCaS tackles the challenge of generating high-resolution images with pretrained diffusion models without retraining. It introduces frequency-aware cascaded sampling, which expands resolution and frequency bands across multiple stages, guided by FA-CFG to emphasize newly introduced frequencies and CA-map reuse to preserve layout. Empirical results across SD2.1, SDXL, and SD3 show FreCaS achieves superior image quality (lower FID_b, FID_p; competitive CLIP) and significant latency reductions (roughly 2–6x faster than strong baselines). The approach enables scalable, training-free high-resolution diffusion-based synthesis with broad applicability to advanced models such as SD3head, offering practical impact for real-world generation tasks.

Abstract

While image generation with diffusion models has achieved a great success, generating images of higher resolution than the training size remains a challenging task due to the high computational cost. Current methods typically perform the entire sampling process at full resolution and process all frequency components simultaneously, contradicting with the inherent coarse-to-fine nature of latent diffusion models and wasting computations on processing premature high-frequency details at early diffusion stages. To address this issue, we introduce an efficient quency-aware scaded ampling framework, in short, for higher-resolution image generation. FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions, progressively expanding frequency bands and refining the corresponding details. We propose an innovative frequency-aware classifier-free guidance (FA-CFG) strategy to assign different guidance strengths for different frequency components, directing the diffusion model to add new details in the expanded frequency domain of each stage. Additionally, we fuse the cross-attention maps of previous and current stages to avoid synthesizing unfaithful layouts. Experiments demonstrate that FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed. In particular, FreCaS is about 2.86 and 6.07 faster than ScaleCrafter and DemoFusion in generating a 20482048 image using a pre-trained SDXL model and achieves an FID improvement of 11.6 and 3.7, respectively. FreCaS can be easily extended to more complex models such as SD3. The source code of FreCaS can be found at https://github.com/xtudbxk/FreCaS.

Paper Structure

This paper contains 31 sections, 7 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: From (a) to (d), the sub-figures show the PSD curves of latents ${\bm{z}}_\text{900}$, ${\bm{z}}_\text{600}$, ${\bm{z}}_\text{300}$ and ${\bm{z}}_\text{0}$ of SDXL, respectively. One can see that the energy of synthesized clean signals (the red slashed regions) first emerges in the low-frequency band and gradually expands to high-frequency band.
  • Figure 2: (a) The overall framework of FreCaS. The entire $T$-step sampling process is divided into $N+1$ stages of increasing resolutions and expanding frequency bands. FreCaS starts the sampling process at the training size and obtains the last latent ${\bm{z}}^{s_\text{0}}_{L}$ at that stage. Then, FreCaS continues the sampling from the first latent ${\bm{z}}^{s_\text{1}}_F$ at the next stage with a larger resolution and expanded frequency domain. This procedure is repeated until the final latent ${\bm{z}}^{s_N}_\text{0}$ at stage $N$ is obtained. A decoder is then used to generate the final image. (b) FA-CFG strategy. We separate the original denoising scores into low-frequency and high-frequency components and assign a higher CFG strength to the high-frequency part. The two parts are then combined to obtain the final denoising score $\hat{{\bm{\epsilon}}}$.
  • Figure 3: Visual comparison on $\times 4$ and $\times 16$ experiments of SD2.1 and SDXL. From top to bottom, the prompts used in the four groups of examples are: 1. "A cosmic traveler, floating in zero gravity, spacesuit reflecting the Earth below, stars twinkling in the distance." 2. "A fierce Viking, axe in hand, leading a raid, the longship slicing through the waves." 3. "A bustling flower market, stalls filled with bouquets, the air thick with fragrance, people selecting their favorites." 4. "Tokyo Japan Retro Skyline, Airplane, Railroad Train, Moon etc. - Modern Postcard". Zoom-in for better view.
  • Figure 4: Ablation studies on $w_l$ and $w_h$ in FA-CFG strategy and $w_c$ in CA-maps reutilization.
  • Figure 5: Visual results of adjusting $w_h$ in the FA-CFG strategy. From top to bottom, the prompts are "Eccentric Shaggy Woman with Pet - Little Puppy" and "Rabat Painting - Mdina Poppies Malta by Richard Harpum", respectively.
  • ...and 9 more figures