Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Zhendong Wang; Yifan Jiang; Huangjie Zheng; Peihao Wang; Pengcheng He; Zhangyang Wang; Weizhu Chen; Mingyuan Zhou

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou

TL;DR

Patch Diffusion introduces a patch-wise, coordinate-conditioned score-matching framework to dramatically reduce diffusion-model training time and data requirements. By training on randomly cropped patches with location and size as conditions and employing multi-scale patch scheduling, the method preserves global coherence and maintains standard sampling. Empirical results show at least 2× faster training and strong performance in small-data regimes, including finetuning and extrapolation capabilities. The approach is plug-and-play, backbone- and sampler-agnostic, and points to future gains via advanced positional embeddings and theoretical convergence analysis.

Abstract

Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

TL;DR

Abstract

faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets,

, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64

64, 1.93 on AFHQv2-Wild-64

64, and 2.72 on ImageNet-256

256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.

Paper Structure (26 sections, 10 equations, 7 figures, 2 tables)

This paper contains 26 sections, 10 equations, 7 figures, 2 tables.

Introduction
Related Work
Preliminaries for diffusion models.
Patch Diffusion Training
Patch-wise Score Matching
Progressive and Stochastic Patch Size Scheduling
Conditional Coordinates for Patch Location
Sampling
Experiments
Datasets.
Evaluation protocol.
Implementations.
Ablation study
Impact of full-size images.
Patch size scheduling.
...and 11 more sections

Figures (7)

Figure 1: Illustration of Patch Diffusion on training and sampling.
Figure 2: FID results on CelebA-64$\times$64 with different $p$ values.
Figure 4: Randomly generated images from Patch Diffusion (EDM-DDPM++ backbone) trained on CelebA-64$\times$64 and FFHQ-64$\times$64, and Latent Patch Diffusion (EDM-ADM backbone) trained on ImageNet-256$\times$256.
Figure 7: Extrapolation Results. Patch Diffusion could generate beyond the boundary by extrapolating the learned coordinate manifold. For each pair of images, the left panel is the reference image in resolution 256 $\times$ 256 and it is fixed in the center during the reverse process of Patch Diffusion, while the right panel shows the generated sample in resolution 384 $\times$ 384, where the out-of-boundary region is regenerated. Note our model is trained only on 256 $\times$ 256 images.
Figure 8: Randomly generated images from Patch Diffusion (EDM-DDPM++ backbone) trained on LSUN-Bedroom/Church-256$\times$256.
...and 2 more figures

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

TL;DR

Abstract

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)