Table of Contents
Fetching ...

LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

Jiuyi Xu, Qing Jin, Meida Chen, Andrew Feng, Yang Sui, Yangming Shi

Abstract

Diffusion models have achieved remarkable success in image generation but their practical application is often hindered by the slow sampling speed. Prior efforts of improving efficiency primarily focus on compressing models or reducing the total number of denoising steps, largely neglecting the possibility to leverage multiple input resolutions in the generation process. In this work, we propose LowDiff, a novel and efficient diffusion framework based on a cascaded approach by generating increasingly higher resolution outputs. Besides, LowDiff employs a unified model to progressively refine images from low resolution to the desired resolution. With the proposed architecture design and generation techniques, we achieve comparable or even superior performance with much fewer high-resolution sampling steps. LowDiff is applicable to diffusion models in both pixel space and latent space. Extensive experiments on both conditional and unconditional generation tasks across CIFAR-10, FFHQ and ImageNet demonstrate the effectiveness and generality of our method. Results show over 50% throughput improvement across all datasets and settings while maintaining comparable or better quality. On unconditional CIFAR-10, LowDiff achieves an FID of 2.11 and IS of 9.87, while on conditional CIFAR-10, an FID of 1.94 and IS of 10.03. On FFHQ 64x64, LowDiff achieves an FID of 2.43, and on ImageNet 256x256, LowDiff built on LightningDiT-B/1 produces high-quality samples with a FID of 4.00 and an IS of 195.06, together with substantial efficiency gains.

LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

Abstract

Diffusion models have achieved remarkable success in image generation but their practical application is often hindered by the slow sampling speed. Prior efforts of improving efficiency primarily focus on compressing models or reducing the total number of denoising steps, largely neglecting the possibility to leverage multiple input resolutions in the generation process. In this work, we propose LowDiff, a novel and efficient diffusion framework based on a cascaded approach by generating increasingly higher resolution outputs. Besides, LowDiff employs a unified model to progressively refine images from low resolution to the desired resolution. With the proposed architecture design and generation techniques, we achieve comparable or even superior performance with much fewer high-resolution sampling steps. LowDiff is applicable to diffusion models in both pixel space and latent space. Extensive experiments on both conditional and unconditional generation tasks across CIFAR-10, FFHQ and ImageNet demonstrate the effectiveness and generality of our method. Results show over 50% throughput improvement across all datasets and settings while maintaining comparable or better quality. On unconditional CIFAR-10, LowDiff achieves an FID of 2.11 and IS of 9.87, while on conditional CIFAR-10, an FID of 1.94 and IS of 10.03. On FFHQ 64x64, LowDiff achieves an FID of 2.43, and on ImageNet 256x256, LowDiff built on LightningDiT-B/1 produces high-quality samples with a FID of 4.00 and an IS of 195.06, together with substantial efficiency gains.

Paper Structure

This paper contains 23 sections, 7 figures, 12 tables, 2 algorithms.

Figures (7)

  • Figure 1: Sample quality vs. effective number of function evaluations on CIFAR10 dataset. Our model provides comparable output quality as the previous best results and reduce the required computing efforts by more than 60%.
  • Figure 2: Model Architecture. Our model supports progressive training at arbitrary resolution levels. During the training process, the training for the lowest resolution images is standard and unconditional and the training for higher resolution images is conditioned on the upsampled corresponding lower-resolution noisy images (injected with independent Gaussian noise).
  • Figure 3: Inference Pipeline. The generation proceeds in a bottom-up manner: starting from Gaussian noise, the model first generates the lowest resolution image with truncation. This noisy image is then upsampled to help with higher resolution image generation, which in turn is used to condition the next-resolution output until generating the highest resolution image.
  • Figure 4: Dual usage of low-resolution blocks in our architecture. (a) Low-resolution image generation. The low-resolution blocks are directly used to generate low-resolution images unconditionally. (b) Sub-network reuse for high-resolution generation. The same low-resolution blocks are reused as subblocks within the high-resolution pathway. Instead of operating as an independent module, they are integrated into the high-res network to follow a UNet-style hierarchical design.
  • Figure 5: Generated images for CIFAR10. (a) Unconditional. (b) Conditional.
  • ...and 2 more figures