Table of Contents
Fetching ...

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

Mingbao Lin, Zhihang Lin, Wengyi Zhan, Liujuan Cao, Rongrong Ji

TL;DR

The paper addresses the high cost and complexity of generating high-resolution images from pre-trained low-resolution diffusion models. It introduces CutDiffusion, a tuning-free, two-stage diffusion extrapolation method that splits patch-based extrapolation into comprehensive structure denoising and subsequent detail refinement, employing pixel interaction and pixel relocation. The approach delivers fast, memory-efficient inference with fewer patches and a single upscale step, while achieving strong generation quality compared with both tuning-based and tuning-free baselines. This work lowers the barrier to high-resolution diffusion by enabling cheaper, faster, and more accessible high-resolution image synthesis on consumer hardware, demonstrated on SDXL with thorough ablations and comparisons.

Abstract

Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

TL;DR

The paper addresses the high cost and complexity of generating high-resolution images from pre-trained low-resolution diffusion models. It introduces CutDiffusion, a tuning-free, two-stage diffusion extrapolation method that splits patch-based extrapolation into comprehensive structure denoising and subsequent detail refinement, employing pixel interaction and pixel relocation. The approach delivers fast, memory-efficient inference with fewer patches and a single upscale step, while achieving strong generation quality compared with both tuning-based and tuning-free baselines. This work lowers the barrier to high-resolution diffusion by enabling cheaper, faster, and more accessible high-resolution image synthesis on consumer hardware, demonstrated on SDXL with thorough ablations and comparisons.

Abstract

Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.
Paper Structure (30 sections, 4 equations, 9 figures, 4 tables)

This paper contains 30 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Visualization of decoded $\mathbf{I}_t$. The $\mathbf{I}_0$ denotes final clean image (1024$\times$1024).
  • Figure 2: CutDiffusion framework where "$\mathbf{>>}$" denotes workflow. Comprehensive structure denoising assigns similar content to each non-overlapping patch. Specific detail refinement enhances details within overlapping patches. Best view with zooming in.
  • Figure 3: A comparison of higher-resolution generation. Best view with zooming in.
  • Figure 4: Visual results of "A cute corgi on the lawn." Best view with zooming in.
  • Figure 5: Higher-resolution images for different $T'$ values. Best view with zooming in.
  • ...and 4 more figures