Provably Near-Optimal Federated Ensemble Distillation with Negligible Overhead
Won-Jun Jang, Hyeon-Seo Park, Si-Hyeon Lee
TL;DR
The paper tackles the challenge of non-IID client data in federated learning by enhancing ensemble distillation on a server using pseudo-labels. It introduces FedGO, a theoretically grounded weighting method that leverages client discriminators trained with a server-distributed generator to form an optimal model ensemble, enabling near-optimal server performance with negligible communication, privacy, and computation overhead. The authors establish formal bounds linking server loss to ensemble distillation loss and distributional differences, and they validate FedGO across CIFAR-10/100 and ImageNet100 with both server datasets and data-free variants. The approach yields faster convergence and higher accuracy than strong baselines, highlighting practical impact for scalable, heterogeneous FL deployments and robust knowledge transfer from diverse clients. The work also provides open-source code and a systematic analysis of overheads, privacy, and robustness, including data-free and Byzantine-resilience scenarios.
Abstract
Federated ensemble distillation addresses client heterogeneity by generating pseudo-labels for an unlabeled server dataset based on client predictions and training the server model using the pseudo-labeled dataset. The unlabeled server dataset can either be pre-existing or generated through a data-free approach. The effectiveness of this approach critically depends on the method of assigning weights to client predictions when creating pseudo-labels, especially in highly heterogeneous settings. Inspired by theoretical results from GANs, we propose a provably near-optimal weighting method that leverages client discriminators trained with a server-distributed generator and local datasets. Our experiments on various image classification tasks demonstrate that the proposed method significantly outperforms baselines. Furthermore, we show that the additional communication cost, client-side privacy leakage, and client-side computational overhead introduced by our method are negligible, both in scenarios with and without a pre-existing server dataset.
