FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning

Hamza Reguieg; Mohamed El Kamili; Essaid Sabir

FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning

Hamza Reguieg, Mohamed El Kamili, Essaid Sabir

TL;DR

Results indicate that coupling temporal smoothing with logits-only aggregation provides a communication-efficient and attack-resilient FL pipeline that is deployment-friendly and compatible with secure aggregation and differential privacy, since only aggregated or obfuscated model outputs are exchanged.

Abstract

Federated learning (FL) often degrades when clients hold heterogeneous non-Independent and Identically Distributed (non-IID) data and when some clients behave adversarially, leading to client drift, slow convergence, and high communication overhead. This paper proposes FedEMA-Distill, a server-side procedure that combines an exponential moving average (EMA) of the global model with ensemble knowledge distillation from client-uploaded prediction logits evaluated on a small public proxy dataset. Clients run standard local training, upload only compressed logits, and may use different model architectures, so no changes are required to client-side software while still supporting model heterogeneity across devices. Experiments on CIFAR-10, CIFAR-100, FEMNIST, and AG News under Dirichlet-0.1 label skew show that FedEMA-Distill improves top-1 accuracy by several percentage points (up to +5% on CIFAR-10 and +6% on CIFAR-100) over representative baselines, reaches a given target accuracy in 30-35% fewer communication rounds, and reduces per-round client uplink payloads to 0.09-0.46 MB, i.e., roughly an order of magnitude less than transmitting full model weights. Using coordinate-wise median or trimmed-mean aggregation of logits at the server further stabilizes training in the presence of up to 10-20% Byzantine clients and yields well-calibrated predictions under attack. These results indicate that coupling temporal smoothing with logits-only aggregation provides a communication-efficient and attack-resilient FL pipeline that is deployment-friendly and compatible with secure aggregation and differential privacy, since only aggregated or obfuscated model outputs are exchanged.

FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 1 equation, 8 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Positioning and quantitative comparison.
Methodology
Why EMA is applied after distillation.
Hyperparameters and practical choices.
Stability of EMA-guided distillation.
System Model
Experiments and Results
Setup
Train/test split and use of validation data.
Client-level dataset characteristics
Main Results: Accuracy and Efficiency
Communication Efficiency
Robustness to Byzantine Clients
...and 6 more sections

Figures (8)

Figure 1: Conceptual overview of FedEMA--Distill.
Figure 2: Total upload to 70% (log scale). Weight-based methods require $\sim$200 MB vs. $\sim$3.6 MB for logits-based training.
Figure 3: Byzantine robustness (CIFAR-10): accuracy vs. fraction of malicious clients. Median/trimmed-mean keep performance high up to $\sim$30% attackers; mean fails early.
Figure 4: Calibration (ECE) on CIFAR-10: lower is better. Distillation and EMA improve calibration.
Figure 5: Fairness: std. dev. of per-client accuracies. Lower spread indicates more equitable performance.
...and 3 more figures

FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning

TL;DR

Abstract

FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)