Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping
Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan "Honza" Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko
TL;DR
This work tackles the challenge of enabling differentially private federated learning for end-to-end speech recognition using large transformer models. It introduces per-layer clipping combined with layer-wise gradient normalization and employs the LAMB optimizer to mitigate gradient heterogeneity across deep networks, providing a theoretical convergence bound that accounts for DP noise, clipping bias, and data heterogeneity. Empirically, it establishes benchmarks on LibriSpeech and Common Voice languages, showing that user-level DP is viable with millions of users and delivers modest WER degradation (e.g., 1.3% absolute at high population for (7.2, 10^{-9})-DP; 4.6% at low population for (4.5, 10^{-9})-DP). The findings reveal that gradient heterogeneity and layer-wise intervention are key to scalable privacy-preserving FL for large models, with results suggesting broader applicability beyond ASR and guiding practical DP-FL recipe design. Overall, the paper provides a practical blueprint for privacy-preserving FL with large-scale transformer models and delivers benchmarks that can inform SEOs and API consumers about DP-FL performance in speech domains.
Abstract
While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge, no existing work establishes a competitive, practical recipe for FL with DP in the context of ASR. To address this gap, we establish \textbf{the first benchmark for FL with DP in end-to-end ASR}. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. Consistent with these theoretical insights, our empirical results show that FL with DP is viable under strong privacy guarantees, provided a population of at least several million users. Specifically, we achieve user-level (7.2, $10^{-9}$)-DP (resp. (4.5, $10^{-9}$)-DP) with only a 1.3% (resp. 4.6%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover - particularly those concerning gradient heterogeneity and layer-wise gradient normalization - offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.
