Table of Contents
Fetching ...

Disentangling data distribution for Federated Learning

Xinyuan Zhao, Hanlin Gu, Lixin Fan, Yuxing Han, Qiang Yang

TL;DR

This paper tackles FL inefficiency caused by entangled client data by introducing FedDistr, a diffusion-based method that disentangles client distributions into base components and aligns them on the server for one-round communication. It provides theoretical guarantees under fully disentangled and near-disentangled regimes and demonstrates empirical gains on CIFAR100 and DomainNet, achieving favorable utility and communication efficiency while preserving privacy. The approach combines latent diffusion-based distribution disentangling, assignment-based distribution alignment, and synthetic data generation for downstream training, offering a practical pathway toward highly efficient, privacy-conscious FL. Overall, FedDistr shows how distribution-level disentangling can close the gap between FL and ideal distributed systems in real-world, heterogeneous data environments.

Abstract

Federated Learning (FL) facilitates collaborative training of a global model whose performance is boosted by private data owned by distributed clients, without compromising data privacy. Yet the wide applicability of FL is hindered by entanglement of data distributions across different clients. This paper demonstrates for the first time that by disentangling data distributions FL can in principle achieve efficiencies comparable to those of distributed systems, requiring only one round of communication. To this end, we propose a novel FedDistr algorithm, which employs stable diffusion models to decouple and recover data distributions. Empirical results on the CIFAR100 and DomainNet datasets show that FedDistr significantly enhances model utility and efficiency in both disentangled and near-disentangled scenarios while ensuring privacy, outperforming traditional federated learning methods.

Disentangling data distribution for Federated Learning

TL;DR

This paper tackles FL inefficiency caused by entangled client data by introducing FedDistr, a diffusion-based method that disentangles client distributions into base components and aligns them on the server for one-round communication. It provides theoretical guarantees under fully disentangled and near-disentangled regimes and demonstrates empirical gains on CIFAR100 and DomainNet, achieving favorable utility and communication efficiency while preserving privacy. The approach combines latent diffusion-based distribution disentangling, assignment-based distribution alignment, and synthetic data generation for downstream training, offering a practical pathway toward highly efficient, privacy-conscious FL. Overall, FedDistr shows how distribution-level disentangling can close the gap between FL and ideal distributed systems in real-world, heterogeneous data environments.

Abstract

Federated Learning (FL) facilitates collaborative training of a global model whose performance is boosted by private data owned by distributed clients, without compromising data privacy. Yet the wide applicability of FL is hindered by entanglement of data distributions across different clients. This paper demonstrates for the first time that by disentangling data distributions FL can in principle achieve efficiencies comparable to those of distributed systems, requiring only one round of communication. To this end, we propose a novel FedDistr algorithm, which employs stable diffusion models to decouple and recover data distributions. Empirical results on the CIFAR100 and DomainNet datasets show that FedDistr significantly enhances model utility and efficiency in both disentangled and near-disentangled scenarios while ensuring privacy, outperforming traditional federated learning methods.

Paper Structure

This paper contains 25 sections, 7 theorems, 28 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

If $f$ is $L$-lipschitz, a data distribution across clients being disentangled is a sufficient condition for the existence of a privacy-preserving federated algorithm that requires only a single communication round and achieves a utility loss of less than $\epsilon$ with a probability of at least $1

Figures (9)

  • Figure 1: Disentangled and near-disentangled cases: In the Disentangled case, two clients have data distributions on two disentangled base distribution, $P_1$ and $P_2$, separately. In the $\xi$-entangled case, each client has data distributions across both disentangled base distribution $P_1$ and base distribution$P_2$, but with one base distribution dominating the other. In both case, client $k$ ($k=1,2$) communicates with the server through data distribution in a single round (upload the distribution $[[S_k]]$ that applying the privacy preserving mechanishm on $S_k$). The server employs different aggregation strategies for the two scenarios: $\text{Aggre}_A$ and $\text{Aggre}_B$ for disentangled and near-disentangled data distributions respectively.
  • Figure 2: Two cases of $\xi$-entangled: $\xi>0$ (left) and $\xi=0$ (right).
  • Figure 3: Overview of the proposed algorithm FedDistr.
  • Figure 4: Tradeoff between utility loss and communication round for different methods under different $\xi$-entangled scenario on CIFAR100 and DomainNet.
  • Figure 5: Tradeoff between utility loss and privacy leakage for different methods under different $\xi$-entangled scenario on CIFAR100.
  • ...and 4 more figures

Theorems & Definitions (13)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Lemma 3
  • ...and 3 more