Decentralized Diffusion Models
David McAllister, Matthew Tancik, Jiaming Song, Angjoo Kanazawa
TL;DR
The paper addresses the high infrastructure and interconnect costs of centralized diffusion model training by introducing Decentralized Diffusion Models (DDMs) that partition data into $K$ clusters and train independent expert denoisers on each cluster. A separately trained router ensembles the expert outputs at inference, forming $u_t(x_t)=\sum_{k=1}^K r_\theta(x_t,t)\,v_{\theta,t}(x_t)$ to match the global diffusion objective, while avoiding cross-cluster gradient synchronization. The core contributions include the Decentralized Flow Matching (DFM) objective, independent router training, and distillation of the sparse ensemble into a dense model, with strong empirical results on ImageNet and LAION-Aesthetics showing FLOP-for-FLOP improvements and scalable scaling up to eight experts and 24B parameters. The approach enables training on readily available hardware, improves resilience to localized GPU failures, and opens avenues for privacy-preserving or decentralized data settings and broader domain applications beyond image synthesis. Scaling experiments demonstrate practical feasibility on commodity hardware, with eight 3B-parameter experts trained across eight nodes in under a week and evidence of non-saturating performance gains as model capacity grows.
Abstract
Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, the experts ensemble through a lightweight router. We show that the ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of "compute islands," lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can now be trained with just eight individual GPU nodes in less than a week.
