Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation
Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen
TL;DR
This work tackles the costly challenge of adapting diffusion models to higher resolutions by introducing a self-cascade framework that reuses a frozen low-resolution model and guides higher-resolution generation with pivot-based semantic cues. It couples a tuning-free baseline (pivot replacement) with a tunable, plug-in time-aware upsampler system that only requires about 0.002M trainable parameters to learn high-frequency details. The approach achieves over 5x training speed-up and can adapt in roughly 10k tuning steps with virtually no inference overhead, delivering strong image and video synthesis quality up to 16× higher resolutions. Empirical results on Laion-5B and Webvid-10M demonstrate robust performance across tuning-free and tuning settings, offering a scalable path for fast high-resolution diffusion synthesis.
Abstract
Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.
