Table of Contents
Fetching ...

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou

TL;DR

Patch Diffusion introduces a patch-wise, coordinate-conditioned score-matching framework to dramatically reduce diffusion-model training time and data requirements. By training on randomly cropped patches with location and size as conditions and employing multi-scale patch scheduling, the method preserves global coherence and maintains standard sampling. Empirical results show at least 2× faster training and strong performance in small-data regimes, including finetuning and extrapolation capabilities. The approach is plug-and-play, backbone- and sampler-agnostic, and points to future gains via advanced positional embeddings and theoretical convergence analysis.

Abstract

Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on ImageNet-256$\times$256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

TL;DR

Patch Diffusion introduces a patch-wise, coordinate-conditioned score-matching framework to dramatically reduce diffusion-model training time and data requirements. By training on randomly cropped patches with location and size as conditions and employing multi-scale patch scheduling, the method preserves global coherence and maintains standard sampling. Empirical results show at least 2× faster training and strong performance in small-data regimes, including finetuning and extrapolation capabilities. The approach is plug-and-play, backbone- and sampler-agnostic, and points to future gains via advanced positional embeddings and theoretical convergence analysis.

Abstract

Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, , as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-6464, 1.93 on AFHQv2-Wild-6464, and 2.72 on ImageNet-256256. We share our code and pre-trained models at https://github.com/Zhendong-Wang/Patch-Diffusion.
Paper Structure (26 sections, 10 equations, 7 figures, 2 tables)

This paper contains 26 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of Patch Diffusion on training and sampling.
  • Figure 2: FID results on CelebA-64$\times$64 with different $p$ values.
  • Figure 4: Randomly generated images from Patch Diffusion (EDM-DDPM++ backbone) trained on CelebA-64$\times$64 and FFHQ-64$\times$64, and Latent Patch Diffusion (EDM-ADM backbone) trained on ImageNet-256$\times$256.
  • Figure 7: Extrapolation Results. Patch Diffusion could generate beyond the boundary by extrapolating the learned coordinate manifold. For each pair of images, the left panel is the reference image in resolution 256 $\times$ 256 and it is fixed in the center during the reverse process of Patch Diffusion, while the right panel shows the generated sample in resolution 384 $\times$ 384, where the out-of-boundary region is regenerated. Note our model is trained only on 256 $\times$ 256 images.
  • Figure 8: Randomly generated images from Patch Diffusion (EDM-DDPM++ backbone) trained on LSUN-Bedroom/Church-256$\times$256.
  • ...and 2 more figures