Table of Contents
Fetching ...

Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead

Won-Jun Jang, Hyeon-Seo Park, Si-Hyeon Lee

TL;DR

The paper tackles the challenge of non-IID client data in federated learning by enhancing ensemble distillation on a server using pseudo-labels. It introduces FedGO, a theoretically grounded weighting method that leverages client discriminators trained with a server-distributed generator to form an optimal model ensemble, enabling near-optimal server performance with negligible communication, privacy, and computation overhead. The authors establish formal bounds linking server loss to ensemble distillation loss and distributional differences, and they validate FedGO across CIFAR-10/100 and ImageNet100 with both server datasets and data-free variants. The approach yields faster convergence and higher accuracy than strong baselines, highlighting practical impact for scalable, heterogeneous FL deployments and robust knowledge transfer from diverse clients. The work also provides open-source code and a systematic analysis of overheads, privacy, and robustness, including data-free and Byzantine-resilience scenarios.

Abstract

Federated ensemble distillation addresses client heterogeneity by generating pseudo-labels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.

Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead

TL;DR

The paper tackles the challenge of non-IID client data in federated learning by enhancing ensemble distillation on a server using pseudo-labels. It introduces FedGO, a theoretically grounded weighting method that leverages client discriminators trained with a server-distributed generator to form an optimal model ensemble, enabling near-optimal server performance with negligible communication, privacy, and computation overhead. The authors establish formal bounds linking server loss to ensemble distillation loss and distributional differences, and they validate FedGO across CIFAR-10/100 and ImageNet100 with both server datasets and data-free variants. The approach yields faster convergence and higher accuracy than strong baselines, highlighting practical impact for scalable, heterogeneous FL deployments and robust knowledge transfer from diverse clients. The work also provides open-source code and a systematic analysis of overheads, privacy, and robustness, including data-free and Byzantine-resilience scenarios.

Abstract

Federated ensemble distillation addresses client heterogeneity by generating pseudo-labels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.

Paper Structure

This paper contains 51 sections, 11 theorems, 27 equations, 10 figures, 17 tables, 2 algorithms.

Key Result

Theorem 2.1

(GAN) For a fixed generator $G$, let $p_g$ and $p_{\text{data}}$ denote the density functions of the generated distribution by $G$ and the real data distribution, respectively. Then the output of an optimal discriminator $D$ for input data $x$ is given as follows:

Figures (10)

  • Figure 1: A toy example of decision boundaries of aggregated models. Each point represents data, and its color represents the label. The background color represents the decision boundary of each model in the RGB channels. The oracle decision boundary, shown by the black lines, corresponds to the $x$-axis and $y$-axis. For aggregated models, we consider the parameter-averaged model FedAVG and ensemble-distilled models using uniform weighting FedDF, variance weighting Fed-ET, entropy weighting FedHKTpark2024overcoming, domain-aware weighting Da, and ours. Detailed settings are provided in Appendix \ref{['asec:exp1']}.
  • Figure 2: Ensemble test accuracy (%) of FedGO and other baseline weighting methods over communication rounds on CIFAR-10 with $\alpha=0.1$ and $\alpha=0.05$.
  • Figure 3: Illustration of our FedGO algorithm.
  • Figure 4: Top row represents the server and clients' datasets. Bottom row, showing the decision boundaries of the aggregated models, is the same as Figure \ref{['fig:1']} and copied here for ease of analysis.
  • Figure 5: Client data split for CIFAR-10 with $\alpha=0.1, 0.05$.
  • ...and 5 more figures

Theorems & Definitions (20)

  • Theorem 2.1
  • Definition 3.1
  • Theorem 3.2
  • Corollary 3.3
  • Theorem 3.4
  • Definition 3.5
  • Theorem 3.6
  • Definition 1.1
  • Definition 1.2
  • Lemma 1.3
  • ...and 10 more