Table of Contents
Fetching ...

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen

TL;DR

This work tackles the costly challenge of adapting diffusion models to higher resolutions by introducing a self-cascade framework that reuses a frozen low-resolution model and guides higher-resolution generation with pivot-based semantic cues. It couples a tuning-free baseline (pivot replacement) with a tunable, plug-in time-aware upsampler system that only requires about 0.002M trainable parameters to learn high-frequency details. The approach achieves over 5x training speed-up and can adapt in roughly 10k tuning steps with virtually no inference overhead, delivering strong image and video synthesis quality up to 16× higher resolutions. Empirical results on Laion-5B and Webvid-10M demonstrate robust performance across tuning-free and tuning settings, offering a scalable path for fast high-resolution diffusion synthesis.

Abstract

Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a $5\times$ training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just $10k$ steps, with virtually no additional inference time.

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

TL;DR

This work tackles the costly challenge of adapting diffusion models to higher resolutions by introducing a self-cascade framework that reuses a frozen low-resolution model and guides higher-resolution generation with pivot-based semantic cues. It couples a tuning-free baseline (pivot replacement) with a tunable, plug-in time-aware upsampler system that only requires about 0.002M trainable parameters to learn high-frequency details. The approach achieves over 5x training speed-up and can adapt in roughly 10k tuning steps with virtually no inference overhead, delivering strong image and video synthesis quality up to 16× higher resolutions. Empirical results on Laion-5B and Webvid-10M demonstrate robust performance across tuning-free and tuning settings, offering a scalable path for fast high-resolution diffusion synthesis.

Abstract

Diffusion models have proven to be highly effective in image and video generation; however, they encounter challenges in the correct composition of objects when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models to higher resolution demands substantial computational and optimization resources, yet achieving generation capabilities comparable to low-resolution models remains challenging. This paper proposes a novel self-cascade diffusion model that leverages the knowledge gained from a well-trained low-resolution image/video generation model, enabling rapid adaptation to higher-resolution generation. Building on this, we employ the pivot replacement strategy to facilitate a tuning-free version by progressively leveraging reliable semantic guidance derived from the low-resolution model. We further propose to integrate a sequence of learnable multi-scale upsampler modules for a tuning version capable of efficiently learning structural details at a new scale from a small amount of newly acquired high-resolution training data. Compared to full fine-tuning, our approach achieves a training speed-up and requires only 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher-resolution image and video synthesis by fine-tuning for just steps, with virtually no additional inference time.
Paper Structure (13 sections, 9 equations, 8 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 9 equations, 8 figures, 3 tables, 2 algorithms.

Figures (8)

  • Figure 1: The FVD$\downarrow$ score averages for both the full fine-tuning (Full-FT) and our proposed fast adaptation method (Ours) are assessed every $5k$ iterations on the Webvid-10Mwebvid benchmark. We observe that full fine-tuning necessitates a large number of training steps and suffers from poor composition ability and desaturation issues. In contrast, our method enables rapid adaptation to the higher-resolution domain while preserving reliable semantic and local structure generation capabilities.
  • Figure 2: Illustration of the proposed self-cascade diffusion model, which is implemented in both tuning-free and tuning versions. (a) For the tuning-free version, we cyclically re-utilize the low-resolution model to progressively adapt it to the higher-resolution generation; (b) For the tuning version, we additionally plug feature upsamplers ($\Phi$) into the base low-resolution generation model: the denoising process of image $z^r_t$ in step $t$ will be guided by the pivot guidance $z^{r-1}_0$ from the pivot stage (last stage) with a series of plugged-in tuneable upsampler modules.
  • Figure 3: Visual examples of Ours-T (Tuning) on the higher-resolution adaptation to various higher resolutions, e.g., $1024^2$, $3072\times 1536$, $1536\times 3072$, and $2048^2$, with the pre-trained SD 2.1 trained with $512^2$ images, comparing to $1024^2$ results of Full fine-tuning (Full-FT) and LORA-R4 (right down corner: red dashed box). Please zoom in for more details.
  • Figure 4: Visual quality comparisons between full fine-tuning ($50k$) and Ours-T ($10k$) on higher-resolution video synthesis of $16\times 512^2$.
  • Figure 5: Average FID and FVD scores of three methods every $5k$ iterations on image (Laion-5B) and video (Webvid-10M) datasets. Our observations indicate that our method can rapidly adapt to the higher-resolution domain while maintaining a robust performance among both image and video generation.
  • ...and 3 more figures