Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

Martin Pelikan; Sheikh Shams Azam; Vitaly Feldman; Jan "Honza" Silovsky; Kunal Talwar; Christopher G. Brinton; Tatiana Likhomanenko

Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko

TL;DR

This work tackles the challenge of enabling differentially private federated learning for end-to-end speech recognition using large transformer models. It introduces per-layer clipping combined with layer-wise gradient normalization and employs the LAMB optimizer to mitigate gradient heterogeneity across deep networks, providing a theoretical convergence bound that accounts for DP noise, clipping bias, and data heterogeneity. Empirically, it establishes benchmarks on LibriSpeech and Common Voice languages, showing that user-level DP is viable with millions of users and delivers modest WER degradation (e.g., 1.3% absolute at high population for (7.2, 10^{-9})-DP; 4.6% at low population for (4.5, 10^{-9})-DP). The findings reveal that gradient heterogeneity and layer-wise intervention are key to scalable privacy-preserving FL for large models, with results suggesting broader applicability beyond ASR and guiding practical DP-FL recipe design. Overall, the paper provides a practical blueprint for privacy-preserving FL with large-scale transformer models and delivers benchmarks that can inform SEOs and API consumers about DP-FL performance in speech domains.

Abstract

While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge, no existing work establishes a competitive, practical recipe for FL with DP in the context of ASR. To address this gap, we establish \textbf{the first benchmark for FL with DP in end-to-end ASR}. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. Consistent with these theoretical insights, our empirical results show that FL with DP is viable under strong privacy guarantees, provided a population of at least several million users. Specifically, we achieve user-level (7.2, $10^{-9}$)-DP (resp. (4.5, $10^{-9}$)-DP) with only a 1.3% (resp. 4.6%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover - particularly those concerning gradient heterogeneity and layer-wise gradient normalization - offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.

Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

TL;DR

Abstract

)-DP (resp. (4.5,

)-DP) with only a 1.3% (resp. 4.6%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover - particularly those concerning gradient heterogeneity and layer-wise gradient normalization - offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.

Paper Structure (81 sections, 8 theorems, 79 equations, 25 figures, 22 tables, 1 algorithm)

This paper contains 81 sections, 8 theorems, 79 equations, 25 figures, 22 tables, 1 algorithm.

Introduction
Federated Learning with Differential Privacy: Background and Notation
Federated Learning (FL)
FL with Differential Privacy (DP)
Theoretical Analysis: Adaptive Optimizers and Per-Layer Clipping
LAMB Optimizer.
Per-Layer Clipping.
Assumptions.
Interpreting the Bounds.
Recovering Prior Bounds.
Impact of Gradient Heterogeneity across Batches and Clients.
Trade-offs Between Clipping Constant and DP noise.
Benefits of Per-Layer Intervention.
Empirical Analysis
Data
...and 66 more sections

Key Result

Theorem 1

For the DP-mechanism in Algorithm alg:fl-dp, the moments accountant of the sampled Gaussian mechanism correctly computes privacy loss with the noise scale of $z=\sigma_{_\mathrm{DP}}/\mathbb{S}$ and central steps $T$, where $\mathbb{S}=1/(qK)$ and noise $\sigma_{_\mathrm{DP}}$, probability of user s

Figures (25)

Figure 1: $(\varepsilon, \delta)$-DP guarantees: central seed trained on LibriSpeech (100h) and fine-tuned with federated learning and differential privacy on Common Voice (1,500h) shows practical quality while preserving $(\varepsilon, \delta)$-DP for extrapolation to larger population and cohort size.
Figure 2: Train distribution in LS and CV: per speaker #minutes (top) and #samples (bottom).
Figure 3: Impact of the cohort size $S$ and seed models on FL models trained on LS. We use exponential decay for central LR starting at $t=1,000$, decay rate $0.6$, and transition steps $500$ (w/o seed model) or $250$ (w/ seed model) with $T=2$k total central steps and $10$ local epochs. Local (central) LR is 0.4 (0.006) (w/o seed model) or 0.2 (0.003) (w/ seed model). See details in Appendix \ref{['app:fl-details-en']}, Table \ref{['table-ls-seed-results']}.
Figure 4: Impact of the cohort size $S$ and seed models on FL models trained on CV: English (left) and French/German (right). We use exponential decay for central LR starting at $t=1,000$ (w/o seed model) or $750$ (w/ seed model), decay rate $0.6$, and transition steps $500$ (w/o seed model) or $750$ (w/ seed model) with $T=2$k total central steps and $10$ local epochs. Local (central) LR is 0.4 (0.006) (w/o seed model) or 0.2 (0.002) (w/ seed model). See details in Appendix \ref{['app:fl-details-en']}, Tables \ref{['table-cv-seed-results']} and \ref{['table-compare-languages']}.
Figure 5: Impact of randomizing the distribution of data across users for LS (left, middle) and CV (right) measured by WER. Parameter settings are described in Figure \ref{['fig:ls-cohort']} for LS and Figure \ref{['fig:cv-cohort']} for CV. While the original training data are non-IID (solid), IID (dashed) versions of LS-960, LS-860 and CV-en-train are created by choosing a user id uniformly and randomly from the set of user ids for each data point in the corresponding dataset. Detailed numbers are in Appendix \ref{['app:fl-details-en']}, Tables \ref{['table-iid-ls']} and \ref{['table-iid-cv']}.
...and 20 more figures

Theorems & Definitions (17)

Definition 1
Definition 2
Theorem 1
Definition 3
Corollary 1
Lemma 1
proof
Lemma 2
proof
Lemma 3
...and 7 more

Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

TL;DR

Abstract

Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (17)