Recurrent Early Exits for Federated Learning with Heterogeneous Clients

Royson Lee; Javier Fernandez-Marques; Shell Xu Hu; Da Li; Stefanos Laskaridis; Łukasz Dudziak; Timothy Hospedales; Ferenc Huszár; Nicholas D. Lane

Recurrent Early Exits for Federated Learning with Heterogeneous Clients

Royson Lee, Javier Fernandez-Marques, Shell Xu Hu, Da Li, Stefanos Laskaridis, Łukasz Dudziak, Timothy Hospedales, Ferenc Huszár, Nicholas D. Lane

TL;DR

Recurrent Early Exits for Federated Learning with Heterogeneous Clients tackles device heterogeneity in FL by introducing ReeFL, a transformer-based recurrent early exit module shared across sub-models that fuses multi-depth features into a single classifier. It enables per-client adaptive knowledge distillation by selecting the best-performing exit as the teacher and modulates backbone features to enhance deeper predictions, trained end-to-end with a unified objective that includes cross-entropy losses and a KL knowledge transfer term. Empirically, ReeFL outperforms depth- and width-based baselines (DepthFL, InclusiveFL, ScaleFL, ExclusiveFL) across CIFAR-100, FEMNIST, and SpeechCommands, for both 4 and 12 exits, while maintaining reasonable communication and compute costs under PEFT and full-finetuning regimes. The work demonstrates robust performance gains due to feature fusion, dynamic teacher selection, and shared classifier architecture, offering practical benefits for scalable FL on heterogeneous edge devices. It also provides extensive ablations on aggregation, distillation, and feature modulation, highlighting when ReeFL's components most strongly contribute to accuracy.

Abstract

Federated learning (FL) has enabled distributed learning of a model across multiple clients in a privacy-preserving manner. One of the main challenges of FL is to accommodate clients with varying hardware capacities; clients have differing compute and memory requirements. To tackle this challenge, recent state-of-the-art approaches leverage the use of early exits. Nonetheless, these approaches fall short of mitigating the challenges of joint learning multiple exit classifiers, often relying on hand-picked heuristic solutions for knowledge distillation among classifiers and/or utilizing additional layers for weaker classifiers. In this work, instead of utilizing multiple classifiers, we propose a recurrent early exit approach named ReeFL that fuses features from different sub-models into a single shared classifier. Specifically, we use a transformer-based early-exit module shared among sub-models to i) better exploit multi-layer feature representations for task-specific prediction and ii) modulate the feature representation of the backbone model for subsequent predictions. We additionally present a per-client self-distillation approach where the best sub-model is automatically selected as the teacher of the other sub-models at each client. Our experiments on standard image and speech classification benchmarks across various emerging federated fine-tuning baselines demonstrate ReeFL's effectiveness over previous works.

Recurrent Early Exits for Federated Learning with Heterogeneous Clients

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 11 figures, 11 tables)

This paper contains 22 sections, 7 equations, 11 figures, 11 tables.

Introduction
Related Work
Proposed Method
Preliminaries
Recurrent Early Exits
Evaluation
Experimental Setup
Datasets
Model & Client Heterogeneity
Baselines
Hyperparameters
Comparison with Baselines
Ablation
Conclusion
Training & Implementation Details
...and 7 more sections

Figures (11)

Figure 1: Overview of ReeFL. (a) Early exiting of block $l$: Ree takes as input the meta class token $z_{\text{meta}}$, the history of class tokens, [$z_{\text{cls}}^1 \ldots z_{\text{cls}}^{l-1}]$, and the most recent class token, $z_{\text{cls}}^l$, and produces two tokens: 1) the modulated meta-class token $m_0^l$ which participates in early-exit classification and 2) the modulated latest class token $m_l^l$ which is used to replace $z_{\text{cls}}^l$ as a part of input to block $l+1$. Assuming the case where there is an early exit after every block, the forward pass involves running the shown architecture $L$ times with shared Ree module. We assume $m_0^0 \equiv z_{\text{cls}}^0$ to be the starting point. (b) We visualize Ree's feature modulation during the forward pass of a CIFAR-100 image by showing a sequence of attention maps. Starting from block $l=1$, we show the attention map between $z_{\text{cls}}^l$ vs. $z_{1:n}^l$ (in blue) and the attention map between $m_l^l$ and $z_{1:n}^l$ (in pink) alternatively. In this particular example of an image of a baby, the distinctive feature is the face, as learnt by later layers. In the earlier layers, particularly the 2nd, 3rd, and 4th layers, the modulated class token shown in pink aids the backbone model to focus more on the distinguishing parts of the image as compared to the use of the unmodulated class token shown in blue. The figure hence offers some interpretability as to how Ree’s feature modulation affects the self-attention module in the backbone model especially in the earlier layers.
Figure 2: Mean accuracy of each exit across 3 runs on SpeechCommands. More results can be found in the Appendix.
Figure 3: Impact of ReeFL's proposed knowledge distillation on FEMNIST ($4$ exits). See Appendix for more results.
Figure 4: Quantifying training costs for each exit for CIFAR-100, $\alpha=1.0$ for 12 exits, where each dot along each line represents an exit. Similar results on other datasets and scenarios can be found in the Appendix.
Figure 5: Quantifying training costs for each exit for CIFAR-100, $\alpha=1000$ for 12 exits, where each dot along each line represents an exit.
...and 6 more figures

Recurrent Early Exits for Federated Learning with Heterogeneous Clients

TL;DR

Abstract

Recurrent Early Exits for Federated Learning with Heterogeneous Clients

Authors

TL;DR

Abstract

Table of Contents

Figures (11)