Table of Contents
Fetching ...

Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients

Mohamed Nabih Ali, Alessio Brutti, Daniele Falavigna

TL;DR

This work tackles the challenge of federated ASR on heterogeneous edge devices by introducing dynamic early-exit architectures that adapt processing depth to input and resource constraints. The authors prove that federating heterogeneous EE models is equivalent to training a single homogeneous EE model when losses across exits are properly combined, and they derive a practical, exit-aware aggregation scheme for FL. Empirically, on TED-LIUM-3 and VoxPopuli with pretraining on Librispeech, the EE-FL approach with FedAdam (and optional front-end freezing) achieves performance approaching centralized training, even under non-uniform and highly heterogeneous client distributions. The study demonstrates that partial-training via EE not only reduces resource demands but also simplifies privacy-preserving deployment in FL, offering a scalable path for robust, domain-adaptive ASR across diverse devices.

Abstract

Automatic speech recognition models require large amounts of speech recordings for training. However, the collection of such data often is cumbersome and leads to privacy concerns. Federated learning has been widely used as an effective decentralized technique that collaboratively learns a shared prediction model while keeping the data local on different clients. Unfortunately, client devices often feature limited computation and communication resources leading to practical difficulties for large models. In addition, the heterogeneity that characterizes edge devices makes it sub-optimal to generate a single model that fits all of them. Differently from the recent literature, where multiple models with different architectures are used, in this work, we propose using dynamical architectures which, employing early-exit solutions, can adapt their processing (i.e. traversed layers) depending on the input and on the operation conditions. This solution falls in the realm of partial training methods and brings two benefits: a single model is used on a variety of devices; federating the models after local training is straightforward. Experiments on public datasets show that our proposed approach is effective and can be combined with basic federated learning strategies.

Federating Dynamic Models using Early-Exit Architectures for Automatic Speech Recognition on Heterogeneous Clients

TL;DR

This work tackles the challenge of federated ASR on heterogeneous edge devices by introducing dynamic early-exit architectures that adapt processing depth to input and resource constraints. The authors prove that federating heterogeneous EE models is equivalent to training a single homogeneous EE model when losses across exits are properly combined, and they derive a practical, exit-aware aggregation scheme for FL. Empirically, on TED-LIUM-3 and VoxPopuli with pretraining on Librispeech, the EE-FL approach with FedAdam (and optional front-end freezing) achieves performance approaching centralized training, even under non-uniform and highly heterogeneous client distributions. The study demonstrates that partial-training via EE not only reduces resource demands but also simplifies privacy-preserving deployment in FL, offering a scalable path for robust, domain-adaptive ASR across diverse devices.

Abstract

Automatic speech recognition models require large amounts of speech recordings for training. However, the collection of such data often is cumbersome and leads to privacy concerns. Federated learning has been widely used as an effective decentralized technique that collaboratively learns a shared prediction model while keeping the data local on different clients. Unfortunately, client devices often feature limited computation and communication resources leading to practical difficulties for large models. In addition, the heterogeneity that characterizes edge devices makes it sub-optimal to generate a single model that fits all of them. Differently from the recent literature, where multiple models with different architectures are used, in this work, we propose using dynamical architectures which, employing early-exit solutions, can adapt their processing (i.e. traversed layers) depending on the input and on the operation conditions. This solution falls in the realm of partial training methods and brings two benefits: a single model is used on a variety of devices; federating the models after local training is straightforward. Experiments on public datasets show that our proposed approach is effective and can be combined with basic federated learning strategies.
Paper Structure (15 sections, 11 equations, 7 figures, 3 tables)

This paper contains 15 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: (a) Federated learning framework with N connected clients. (b) An early-exit model architecture deployed on different devices.
  • Figure 2: Illustration of server-client aggregation strategy with heterogeneous models. (a) In the common SOTA approaches clients' models have different architectures, sharing a common part (azure nodes) with the global model cho2022heterogeneous. (b) Our proposed approach where all clients share the same architecture (with different layers according to computational resources), while their central agglomeration results into an EE model.
  • Figure 3: WER achieved on TED-LIUM-3 with homogeneous and heterogeneous devices, using FedAvg and FedAdam during the first 200 FL rounds. Each figure refers to one exit. Heterogeneous models are uniformly distributed.
  • Figure 4: Comparison of the WER achieved with FedAdam on TED-LIUM-3 with (pink and pale colors) and without (red and green colors) freezing the convolutional front-end.
  • Figure 5: Comparison of the WER achieved on VoxPopuli with freezing the convolutional front-end for ELF and OFL.
  • ...and 2 more figures