FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu
TL;DR
<3-5 sentence high-level summary> FreeScale tackles the bottleneck of high-resolution image and video generation with pre-trained diffusion models by introducing a tuning-free inference framework that fuses multi-scale information and controls frequency content. It combines Tailored Self-Cascade Upscaling, Restrained Dilated Convolution, and Scale Fusion to preserve global structure while enriching local detail, integrated into self-attention modules for minimal overhead. Across SDXL and VideoCrafter2, FreeScale delivers state-of-the-art or competitive quality at 8k image resolution and 640x1024 video resolution, with strong quantitative metrics and qualitative improvements over prior tuning-free methods. The work highlights practical implications for high-resolution diffusion generation and outlines avenues for scaling and further refinements with base-model-dependent constraints.
Abstract
Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. To tackle this challenge, we propose FreeScale, a tuning-free inference paradigm to enable higher-resolution visual generation via scale fusion. Specifically, FreeScale processes information from different receptive scales and then fuses it by extracting desired frequency components. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Notably, compared with previous best-performing methods, FreeScale unlocks the 8k-resolution text-to-image generation for the first time.
