Table of Contents
Fetching ...

Heterogeneous Decentralized Diffusion Models

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

TL;DR

By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, this framework reduces infrastructure requirements for decentralized generative model training and lowers infrastructure requirements for decentralized generative model training.

Abstract

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time via a deterministic schedule-aware conversion into a common velocity space without retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces compute from 1176 to 72 GPU-days (16x) and data from 158M to 11M (14x). Under aligned inference settings, our heterogeneous 2DDPM:6FM configuration achieves better FID (11.88 vs. 12.45) and higher intra-prompt diversity (LPIPS 0.631 vs. 0.617) than the homogeneous 8FM baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework lowers infrastructure requirements for decentralized generative model training.

Heterogeneous Decentralized Diffusion Models

TL;DR

By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, this framework reduces infrastructure requirements for decentralized generative model training and lowers infrastructure requirements for decentralized generative model training.

Abstract

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time via a deterministic schedule-aware conversion into a common velocity space without retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-alpha's efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces compute from 1176 to 72 GPU-days (16x) and data from 158M to 11M (14x). Under aligned inference settings, our heterogeneous 2DDPM:6FM configuration achieves better FID (11.88 vs. 12.45) and higher intra-prompt diversity (LPIPS 0.631 vs. 0.617) than the homogeneous 8FM baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework lowers infrastructure requirements for decentralized generative model training.
Paper Structure (46 sections, 28 equations, 17 figures, 4 tables)

This paper contains 46 sections, 28 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Text-to-Image Generation with Heterogeneous Decentralized Diffusion. Our framework combines multiple expert models trained with different objectives (DDPM and Flow Matching) in complete isolation to generate high-quality, diverse images from text prompts. All samples are generated at 256$\times$256 resolution using 8 heterogeneous experts trained on LAION-Aesthetics with only 72 A100-days of compute.
  • Figure 2: Inference Pipeline for Heterogeneous Expert Fusion. Given noisy input $(x_t, t, c)$, the router predicts cluster probabilities $p_\phi(k|x_t, t)$ to weight expert contributions. DDPM experts output epsilon predictions while Flow Matching experts output velocity predictions. Schedule-aware conversion functions deterministically unify all predictions into a common velocity space $v^{(k)}$ without retraining, enabling router-weighted fusion $u_t(x_t) = \sum_{k=1}^{K} p_t(k|x_t) \cdot v^{(k)}$ for ODE-based sampling.
  • Figure 4: Impact of Router Threshold on Generation Quality. Different thresholds affect quality-diversity trade-offs.
  • Figure 5: Qualitative comparison: Homogeneous vs. Heterogeneous models. Images generated from identical prompts and random seeds. Homogeneous models (left, trained with Flow Matching only) often appear smoother in texture. Heterogeneous models (right, combining FM and DDPM experts) often preserve sharper local details and richer texture variation.
  • Figure 6: Training Pipeline for Decentralized Heterogeneous Experts. LAION dataset $\mathcal{D}$ is partitioned into $K$ semantic clusters $\{S_1, S_2, \ldots, S_K\}$ using DINOv2 feature extraction and hierarchical clustering. Each expert trains independently on its assigned cluster with heterogeneous objectives: DDPM experts predict noise $\epsilon_{\theta_k}(x_t, t)$ while Flow Matching experts predict velocity $v_{\theta_k}(x_t, t)$. The router network $\phi$ trains on all data to predict cluster assignments via cross-entropy loss. Crucially, there is zero gradient synchronization, parameter sharing, or activation passing between experts during training.
  • ...and 12 more figures