Table of Contents
Fetching ...

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu

TL;DR

<3-5 sentence high-level summary> FreeScale tackles the bottleneck of high-resolution image and video generation with pre-trained diffusion models by introducing a tuning-free inference framework that fuses multi-scale information and controls frequency content. It combines Tailored Self-Cascade Upscaling, Restrained Dilated Convolution, and Scale Fusion to preserve global structure while enriching local detail, integrated into self-attention modules for minimal overhead. Across SDXL and VideoCrafter2, FreeScale delivers state-of-the-art or competitive quality at 8k image resolution and 640x1024 video resolution, with strong quantitative metrics and qualitative improvements over prior tuning-free methods. The work highlights practical implications for high-resolution diffusion generation and outlines avenues for scaling and further refinements with base-model-dependent constraints.

Abstract

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with previous best-performing methods, FreeScale unlocks the 8k-resolution text-to-image generation for the first time.

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

TL;DR

<3-5 sentence high-level summary> FreeScale tackles the bottleneck of high-resolution image and video generation with pre-trained diffusion models by introducing a tuning-free inference framework that fuses multi-scale information and controls frequency content. It combines Tailored Self-Cascade Upscaling, Restrained Dilated Convolution, and Scale Fusion to preserve global structure while enriching local detail, integrated into self-attention modules for minimal overhead. Across SDXL and VideoCrafter2, FreeScale delivers state-of-the-art or competitive quality at 8k image resolution and 640x1024 video resolution, with strong quantitative metrics and qualitative improvements over prior tuning-free methods. The work highlights practical implications for high-resolution diffusion generation and outlines avenues for scaling and further refinements with base-model-dependent constraints.

Abstract

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with previous best-performing methods, FreeScale unlocks the 8k-resolution text-to-image generation for the first time.

Paper Structure

This paper contains 23 sections, 7 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Gallery of FreeScale. Original SDXL sdxl can only generate images with a resolution of up to $1024^2$ without losing quality, while FreeScale successfully extends SDXL to generate $8192^2$ images without any fine-tuning. All generated images are produced using a single A800 GPU. Best viewed ZOOMED-IN.
  • Figure 2: Overall framework of FreeScale. (a) Tailored Self-Cascade Upscaling. FreeScale starts with pure Gaussian noise and progressively denoises it using the training resolution. An image is then generated via the VAE decoder, followed by upscaling to obtain a higher-resolution one. We gradually add noise to the latent of this higher-resolution image and incorporate this forward noise into the denoising process of the higher-resolution latent with the use of restrained dilated convolution. Additionally, for intermediate latent steps, we enhance high-frequency details by applying region-aware detail control using masks derived from the image. (b) Scale Fusion. During denoising, we adapt the self-attention layer to a global and local attention structure. By utilizing Gaussian blur, we fuse high-frequency details from global attention and low-frequency semantics from local attention, serving as the final output of the self-attention layer.
  • Figure 3: Image qualitative comparisons with other baselines. Our method generates both $2048^2$ and $4096^2$ vivid images with better content coherence and local details. Best viewed ZOOMED-IN.
  • Figure 4: Results of flexible control for detail level. A better result will be generated by adding the coefficient weight in the area of Griffons and reducing the coefficient weight in the other regions. Best viewed ZOOMED-IN.
  • Figure 5: Results of local semantic editing. FreeScale makes the hair purple or edits the face to make this person look more Japanese in the higher-resolution ($4096^2$).
  • ...and 8 more figures