Upsample Guidance: Scale Up Diffusion Models without Training
Juno Hwang, Yong-Hyun Park, Junghyo Jo
TL;DR
Upsample Guidance introduces a training-free method to scale diffusion models to higher resolutions by adding a single, noise-based guidance term during sampling. It relies on SNR matching across resolutions to derive an adjusted noisy predictor, enabling parallel low- and high-resolution predictions that are combined with a tunable guidance scale. The approach is model-agnostic, applying to pixel-space, latent diffusion, and video diffusion models, and includes an adaptation strategy for latent models to mitigate artifacts. Empirical results show improved fidelity and prompt alignment at high resolutions with minimal computational overhead, and ablations elucidate the role of time and power adjustments as well as guidance scale.
Abstract
Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to generate higher-resolution images (e.g., $1536^2$) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.
