Synthetic data shuffling accelerates the convergence of federated learning under data heterogeneity

Bo Li; Yasin Esfandiari; Mikkel N. Schmidt; Tommy S. Alstrøm; Sebastian U. Stich

Synthetic data shuffling accelerates the convergence of federated learning under data heterogeneity

Bo Li, Yasin Esfandiari, Mikkel N. Schmidt, Tommy S. Alstrøm, Sebastian U. Stich

TL;DR

The paper addresses convergence challenges in heterogeneous federated learning by establishing a quantitative link between data shuffling and optimization speed. It shows that shuffling a fraction $p$ of data across clients quadratically reduces gradient dissimilarity, enabling faster convergence, and provides convergence-rate bounds for strongly convex and non-convex objectives. Building on this theory, the authors propose Fedssyn, a practical framework that uses locally trained synthetic data generators to produce shuffled synthetic data, preserving data access rights and offering differential privacy options. Empirically, Fedssyn yields substantial reductions in communication rounds and improvements in accuracy across CIFAR-10/100 and varying participation, with DP variants demonstrating privacy-preserving viability. The approach thus offers a principled, privacy-aware pathway to close the performance gap caused by data heterogeneity in FL.

Abstract

In federated learning, data heterogeneity is a critical challenge. A straightforward solution is to shuffle the clients' data to homogenize the distribution. However, this may violate data access rights, and how and when shuffling can accelerate the convergence of a federated optimization algorithm is not theoretically well understood. In this paper, we establish a precise and quantifiable correspondence between data heterogeneity and parameters in the convergence rate when a fraction of data is shuffled across clients. We prove that shuffling can quadratically reduce the gradient dissimilarity with respect to the shuffling percentage, accelerating convergence. Inspired by the theory, we propose a practical approach that addresses the data access rights issue by shuffling locally generated synthetic data. The experimental results show that shuffling synthetic data improves the performance of multiple existing federated learning algorithms by a large margin.

Synthetic data shuffling accelerates the convergence of federated learning under data heterogeneity

TL;DR

The paper addresses convergence challenges in heterogeneous federated learning by establishing a quantitative link between data shuffling and optimization speed. It shows that shuffling a fraction

of data across clients quadratically reduces gradient dissimilarity, enabling faster convergence, and provides convergence-rate bounds for strongly convex and non-convex objectives. Building on this theory, the authors propose Fedssyn, a practical framework that uses locally trained synthetic data generators to produce shuffled synthetic data, preserving data access rights and offering differential privacy options. Empirically, Fedssyn yields substantial reductions in communication rounds and improvements in accuracy across CIFAR-10/100 and varying participation, with DP variants demonstrating privacy-preserving viability. The approach thus offers a principled, privacy-aware pathway to close the performance gap caused by data heterogeneity in FL.

Abstract

Paper Structure (24 sections, 1 theorem, 30 equations, 16 figures, 4 tables, 2 algorithms)

This paper contains 24 sections, 1 theorem, 30 equations, 16 figures, 4 tables, 2 algorithms.

Introduction
Related work
Influence of the data heterogeneity on the convergence rate
Motivation
Modelling Data Shuffling and Theoretical Analysis
Illustrative experiments on convex functions
Synthetic data shuffling
Experimental setup
Experimental results
Towards differentially private Fedssyn
Conclusion
Appendix
Proof
Extra related work comparison
Extra experimental details
...and 9 more sections

Key Result

Lemma 1

If Assumption assum:original_gradient_dissimilarity -- assum:delta hold, then in expectation over potential randomness in selecting $\tilde{\mathcal{D}}$:

Figures (16)

Figure 1: Our proposed framework. (a) Each client learns a generator with a subset of its local data and generates synthetic data, which are communicated to the server. The server then shuffles and sends the partitioned collection of the synthetic data to each client. (b) With the updated local data, any FL algorithms can be used to learn a server model. (c) When the clients are very heterogeneous, compared to shuffling the real data, shuffling synthetic data achieves a similar accuracy while alleviating information leakage.
Figure 2: Convergence of $\frac{1}{n}\sum_{i=1}^n||\mathbf{x}_i^t-\mathbf{x}^\star||^2$. (a) With a fixed $\zeta^2$ and step size, shuffling reduces the optimal error more when the stochastic noise is low (b) When gradient dissimilarity $\zeta^2$ dominates the convergence, we obtain a super-linear speedup in the number of rounds to reach $\varepsilon$ by shuffling more data. The vertical bar shows the theoretical number of rounds to reach $\varepsilon$. The stepsize is tuned in (b).
Figure 3: Top-1 accuracy. We compare the experiments where the local dataset is $\mathcal{D}_i$, $(1-p)\mathcal{D}_i+p\tilde{\mathcal{D}}_i$ (local+local synthetic data), and $(1-p)\mathcal{D}_i+p\tilde{\mathcal{D}}_{si}$ (local+shuffled synthetic data). The black and green dotted lines represent the accuracy using the centralised real and synthetic data, respectively. Using shuffled synthetic data (red bar) boosts the Top-1 accuracy, and in some cases, even matches the centralised accuracy.
Figure 4: Sensitivity analysis using FedAvg and CIFAR10: (a) the influence of the number of images used for training the generator $\rho\cdot n_i$ and the number of synthetic images per client $\tilde{n}$ ($\alpha=0.1$). We annotate the combination that performs better than the centralized baseline (b) influence of the number of training epochs for the generator with 10 and 40 clients ($\alpha$=0.01). (c) the influence of using different generators ($\alpha=0.01$). We obtain better performance using DDPM than other generators.
Figure 5: The empirical observation of stochastic noise and gradient dissimilarity matches the theoretical statement (experiment using CIFAR10, 10 clients, $\alpha=0.1$, p=0.06)
...and 11 more figures

Theorems & Definitions (4)

Lemma 1
proof
proof
proof

Synthetic data shuffling accelerates the convergence of federated learning under data heterogeneity

TL;DR

Abstract

Synthetic data shuffling accelerates the convergence of federated learning under data heterogeneity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (4)