Table of Contents
Fetching ...

Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogeneous Federated Learning

Jihyun Lim, Junhyuk Jo, Tuo Zhang, Sunwoo Lee

TL;DR

The paper tackles weak client participation in heterogeneous Federated Learning by showing that server-side logit-ensemble KD degrades performance under non-IID data. It proposes on-device knowledge distillation using a small auxiliary model trained on local labeled data, with strong clients transferring knowledge to a large target model via on-device KD on unlabeled data, guided by a two-step protocol. Empirical results across CIFAR-10, FEMNIST, CIFAR-100, IMDB, and Google Speech demonstrate superior accuracy over SOTA KD-based FL methods, while preserving data privacy and accommodating device heterogeneity. The work also provides a theoretical generalization bound showing that incorporating unlabeled local data via KD reduces the bound, highlighting practical benefits for scalable edge learning.

Abstract

Online Knowledge Distillation (KD) is recently highlighted to train large models in Federated Learning (FL) environments. Many existing studies adopt the logit ensemble method to perform KD on the server side. However, they often assume that unlabeled data collected at the edge is centralized on the server. Moreover, the logit ensemble method personalizes local models, which can degrade the quality of soft targets, especially when data is highly non-IID. To address these critical limitations,we propose a novel on-device KD-based heterogeneous FL method. Our approach leverages a small auxiliary model to learn from labeled local data. Subsequently, a subset of clients with strong system resources transfers knowledge to a large model through on-device KD using their unlabeled data. Our extensive experiments demonstrate that our on-device KD-based heterogeneous FL method effectively utilizes the system resources of all edge devices as well as the unlabeled data, resulting in higher accuracy compared to SOTA KD-based FL methods.

Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogeneous Federated Learning

TL;DR

The paper tackles weak client participation in heterogeneous Federated Learning by showing that server-side logit-ensemble KD degrades performance under non-IID data. It proposes on-device knowledge distillation using a small auxiliary model trained on local labeled data, with strong clients transferring knowledge to a large target model via on-device KD on unlabeled data, guided by a two-step protocol. Empirical results across CIFAR-10, FEMNIST, CIFAR-100, IMDB, and Google Speech demonstrate superior accuracy over SOTA KD-based FL methods, while preserving data privacy and accommodating device heterogeneity. The work also provides a theoretical generalization bound showing that incorporating unlabeled local data via KD reduces the bound, highlighting practical benefits for scalable edge learning.

Abstract

Online Knowledge Distillation (KD) is recently highlighted to train large models in Federated Learning (FL) environments. Many existing studies adopt the logit ensemble method to perform KD on the server side. However, they often assume that unlabeled data collected at the edge is centralized on the server. Moreover, the logit ensemble method personalizes local models, which can degrade the quality of soft targets, especially when data is highly non-IID. To address these critical limitations,we propose a novel on-device KD-based heterogeneous FL method. Our approach leverages a small auxiliary model to learn from labeled local data. Subsequently, a subset of clients with strong system resources transfers knowledge to a large model through on-device KD using their unlabeled data. Our extensive experiments demonstrate that our on-device KD-based heterogeneous FL method effectively utilizes the system resources of all edge devices as well as the unlabeled data, resulting in higher accuracy compared to SOTA KD-based FL methods.

Paper Structure

This paper contains 13 sections, 8 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Non-IID distributions of CIFAR-10 (left-top) and Google Speech (left-bottom). Label-based Dirichlet distributions ($\alpha=0.1$) are used to get 100 local distributions. Due to the limited space, we only show 20 of them. The accuracy of a small model trained via FL provides remarkably more accurate logits than the logit ensemble.
  • Figure 2: The schematic illustration of the proposed heterogeneous FL method. For simplicity, the server is omitted in this schematic. An auxiliary (small) model is trained by all the devices using their private data. Then, the global knowledge is transferred to the target (large) model through on-device KD. This distributed knowledge transfer approach enables to not only fully utilize heterogeneous edge devices' system resources but also effectively use the unlabeled private data for training.
  • Figure 3: The training loss (top) and validation accuracy (bottom) curves corresponding to Table \ref{['tab:base']}.

Theorems & Definitions (1)

  • proof