Table of Contents
Fetching ...

Upsample Guidance: Scale Up Diffusion Models without Training

Juno Hwang, Yong-Hyun Park, Junghyo Jo

TL;DR

Upsample Guidance introduces a training-free method to scale diffusion models to higher resolutions by adding a single, noise-based guidance term during sampling. It relies on SNR matching across resolutions to derive an adjusted noisy predictor, enabling parallel low- and high-resolution predictions that are combined with a tunable guidance scale. The approach is model-agnostic, applying to pixel-space, latent diffusion, and video diffusion models, and includes an adaptation strategy for latent models to mitigate artifacts. Empirical results show improved fidelity and prompt alignment at high resolutions with minimal computational overhead, and ablations elucidate the role of time and power adjustments as well as guidance scale.

Abstract

Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., $512^2$) to generate higher-resolution images (e.g., $1536^2$) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.

Upsample Guidance: Scale Up Diffusion Models without Training

TL;DR

Upsample Guidance introduces a training-free method to scale diffusion models to higher resolutions by adding a single, noise-based guidance term during sampling. It relies on SNR matching across resolutions to derive an adjusted noisy predictor, enabling parallel low- and high-resolution predictions that are combined with a tunable guidance scale. The approach is model-agnostic, applying to pixel-space, latent diffusion, and video diffusion models, and includes an adaptation strategy for latent models to mitigate artifacts. Empirical results show improved fidelity and prompt alignment at high resolutions with minimal computational overhead, and ablations elucidate the role of time and power adjustments as well as guidance scale.

Abstract

Diffusion models have demonstrated superior performance across various generative tasks including images, videos, and audio. However, they encounter difficulties in directly generating high-resolution samples. Previously proposed solutions to this issue involve modifying the architecture, further training, or partitioning the sampling process into multiple stages. These methods have the limitation of not being able to directly utilize pre-trained models as-is, requiring additional work. In this paper, we introduce upsample guidance, a technique that adapts pretrained diffusion model (e.g., ) to generate higher-resolution images (e.g., ) by adding only a single term in the sampling process. Remarkably, this technique does not necessitate any additional training or relying on external models. We demonstrate that upsample guidance can be applied to various models, such as pixel-space, latent space, and video diffusion models. We also observed that the proper selection of guidance scale can improve image quality, fidelity, and prompt alignment.
Paper Structure (23 sections, 10 equations, 15 figures)

This paper contains 23 sections, 10 equations, 15 figures.

Figures (15)

  • Figure 1: High-resolution samples with upsample guidance. The original trained resolution is increased ($\geq 2$ times) through upsample guidance. (a) Images sampled at twice the resolution for the models trained on CIFAR-10 and CelebA-HQ datasets at $32^2$ and $256^2$ resolutions, respectively. The adjacent image pairs are sampled from the same initial noise. (b) High-resolution images of latent diffusion models using upsample guidance. (c) Upsampled snapshots of text-to-video models. The upper panel represents spatial upsampling, while the lower panel represents temporal upsampling.
  • Figure 2: Consistency between different resolutions. (a) Downsampled image generated by the diffusion model at the target resolution. (b) Image generated at the trained resolution. The noise reduction due to downsampling creates a significant difference in the recognizability between the central and right images at the trained resolution, indicating a change in their signal-to-noise ratio. For this example, $\alpha_t=0.85$ is used.
  • Figure 3: Conceptual illustration of upsample guidance. The model receives the same noised images at two different resolutions in parallel, but time and power are adjusted at the trained resolution. The difference between the two predicted noises then acts as guidance, which is added to the total noise.
  • Figure 4: Artifacts of encoder-decoder in a LDM. When an image is upsampled or downsampled in the latent space of an LDM and then decoded back into pixel space, artifacts are introduced. The variational autoencoder introduces nonlinearity in the implementation of upsample guidance, and significant degradation can be observed in both cases.
  • Figure 5: Upsampling across various image generation models, resolutions, and conditional generation methods. Unconditional image generation, such as CIFAR-10 and CelebA-HQ, was sampled in the pixel space. For the text-to-image models, the left side of the images represents results without UG, while the right side shows results with UG. We used DreamShaper dreamshaper2023 as an example of fine-tuned LDM. The paired images are all generated from the same initial noise. Across different models, resolutions, prompts, and conditioning, consistently better images were obtained with UG. Notably, our method effectively resolved artifacts where multiple subjects were generated or bad anatomy was present.
  • ...and 10 more figures