Personalized Federated Training of Diffusion Models with Privacy Guarantees
Kumar Kshitij Patel, Weitong Zhang, Lingxiao Wang
TL;DR
This paper tackles data scarcity and privacy constraints in sensitive domains by proposing PFDM, a personalized federated diffusion framework that trains diffusion models without sharing raw data. PFDM decomposes the reverse diffusion into client-specific and global denoisers, enabling clients to control fine-grained generation while the global denoiser learns from noisy, diffused data to preserve privacy. A formal local differential privacy guarantee is established for the global denoiser, with the privacy-utility trade-off tuned by the diffusion steps, and per-pixel DP variants are discussed. Empirical results on CIFAR-10 and MNIST demonstrate competitive performance compared to centralized training and clear improvements over non-collaborative baselines, especially under high data heterogeneity and when generating biased-prone minority classes, highlighting the method’s practical impact for private, collaborative data synthesis.
Abstract
The scarcity of accessible, compliant, and ethically sourced data presents a considerable challenge to the adoption of artificial intelligence (AI) in sensitive fields like healthcare, finance, and biomedical research. Furthermore, access to unrestricted public datasets is increasingly constrained due to rising concerns over privacy, copyright, and competition. Synthetic data has emerged as a promising alternative, and diffusion models -- a cutting-edge generative AI technology -- provide an effective solution for generating high-quality and diverse synthetic data. In this paper, we introduce a novel federated learning framework for training diffusion models on decentralized private datasets. Our framework leverages personalization and the inherent noise in the forward diffusion process to produce high-quality samples while ensuring robust differential privacy guarantees. Our experiments show that our framework outperforms non-collaborative training methods, particularly in settings with high data heterogeneity, and effectively reduces biases and imbalances in synthetic data, resulting in fairer downstream models.
