Decentralized Diffusion Models

David McAllister; Matthew Tancik; Jiaming Song; Angjoo Kanazawa

Decentralized Diffusion Models

David McAllister, Matthew Tancik, Jiaming Song, Angjoo Kanazawa

TL;DR

The paper addresses the high infrastructure and interconnect costs of centralized diffusion model training by introducing Decentralized Diffusion Models (DDMs) that partition data into $K$ clusters and train independent expert denoisers on each cluster. A separately trained router ensembles the expert outputs at inference, forming $u_t(x_t)=\sum_{k=1}^K r_\theta(x_t,t)\,v_{\theta,t}(x_t)$ to match the global diffusion objective, while avoiding cross-cluster gradient synchronization. The core contributions include the Decentralized Flow Matching (DFM) objective, independent router training, and distillation of the sparse ensemble into a dense model, with strong empirical results on ImageNet and LAION-Aesthetics showing FLOP-for-FLOP improvements and scalable scaling up to eight experts and 24B parameters. The approach enables training on readily available hardware, improves resilience to localized GPU failures, and opens avenues for privacy-preserving or decentralized data settings and broader domain applications beyond image synthesis. Scaling experiments demonstrate practical feasibility on commodity hardware, with eight 3B-parameter experts trained across eight nodes in under a week and evidence of non-saturating performance gains as model capacity grows.

Abstract

Large-scale AI model training divides work across thousands of GPUs, then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework for distributing diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, the experts ensemble through a lightweight router. We show that the ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of "compute islands," lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can now be trained with just eight individual GPU nodes in less than a week.

Decentralized Diffusion Models

TL;DR

The paper addresses the high infrastructure and interconnect costs of centralized diffusion model training by introducing Decentralized Diffusion Models (DDMs) that partition data into

clusters and train independent expert denoisers on each cluster. A separately trained router ensembles the expert outputs at inference, forming

to match the global diffusion objective, while avoiding cross-cluster gradient synchronization. The core contributions include the Decentralized Flow Matching (DFM) objective, independent router training, and distillation of the sparse ensemble into a dense model, with strong empirical results on ImageNet and LAION-Aesthetics showing FLOP-for-FLOP improvements and scalable scaling up to eight experts and 24B parameters. The approach enables training on readily available hardware, improves resilience to localized GPU failures, and opens avenues for privacy-preserving or decentralized data settings and broader domain applications beyond image synthesis. Scaling experiments demonstrate practical feasibility on commodity hardware, with eight 3B-parameter experts trained across eight nodes in under a week and evidence of non-saturating performance gains as model capacity grows.

Abstract

Paper Structure (26 sections, 16 equations, 17 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 16 equations, 17 figures, 2 tables, 1 algorithm.

Introduction
Related Works
Accelerating Diffusion Models
Mixture of Experts
Low Communication Learning
Decentralized Flow Matching
Preliminary: Flow Matching Objective
Decentralized Flow Matching Objective
Router Training
Expert Training
Inference Strategy
Distillation
Experiments
Implementation Details
Ensembling at Test-Time
...and 11 more sections

Figures (17)

Figure 1: Decentralized Diffusion Models (DDM). Left: Existing diffusion models (monolithic) require synchronized, centralized training across thousands of GPUs, making high-quality training systems expensive and inaccessible. Right: DDM divides a diffusion model into an ensemble of expert models, each trained on its own data cluster in complete isolation. This ensemble collectively optimizes the same diffusion objective as a single model trained on all the data. This enables flexible training across diverse cloud or academic compute facilities. At inference, the ensemble delivers improved performance at the same FLOP-cost, making high-quality diffusion model training more efficient and accessible. See Figure \ref{['fig:full-page']} for large-scale DDM model samples.
Figure 2: Decentralized diffusion models train on readily-available hardware and generate high quality, diverse images. We present selected samples from our 8x3B parameter model.
Figure 3: Decentralized Diffusion Model (DDM) Training Overview. DDMs follow a three-step training process. We first cluster the dataset using off-the-shelf representation extraction models. We train a diffusion model over each of these clusters and a router that associates any input $x_t$ with its most likely clusters. At test-time, given a noisy sample, each expert (in red and green) predict their own flows, which combine linearly via the weights predicted by the router. The combined flow samples the entire distribution and is illustrated on the right.
Figure 4: Ablations at the DiT XL model scale. Eight-expert DDMs display the best consistent performance on ImageNet (a) and LAION Aesthetics (b). We show the importance of image-based clustering on ImageNet compared to random clustering (c). Finally, FLOP-for-FLOP, decentralized diffusion models outperform monolith diffusion models on both datasets (d, e).
Figure 5: DDMs optimize the global diffusion objective. We average samples from the monolithic and DDM ImageNet models using a deterministic sampler with matching random seeds (left) and compare them to outputs generated with random noise samples (right). The left samples are highly correlated, appearing less blurry.
...and 12 more figures

Decentralized Diffusion Models

TL;DR

Abstract

Decentralized Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (17)