Table of Contents
Fetching ...

Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition

Xuan Kan, Yonghui Xiao, Tien-Ju Yang, Nanxin Chen, Rajiv Mathews

TL;DR

The paper tackles privacy-preserving automatic speech recognition (ASR) across diverse user domains by combining federated learning with parameter-efficient adapter-based domain adaptation. It presents a three-stage training pipeline and a detailed adapter design space for integrating adapters into a Conformer encoder under federated learning, aiming to reduce data and communication costs. Key findings show that federated adapter tuning can match the performance of centralized tuning while dramatically cutting updated parameters and communication, with parallel adapters generally delivering the best transfer and a manageable trade-off between adaptation and original-domain generalization. The work thereby enables practical, privacy-preserving, on-device ASR personalization across accent, dialect, and language variation, guiding future research in efficient federated domain adaptation.

Abstract

This work explores the challenge of enhancing Automatic Speech Recognition (ASR) model performance across various user-specific domains while preserving user data privacy. We employ federated learning and parameter-efficient domain adaptation methods to solve the (1) massive data requirement of ASR models from user-specific scenarios and (2) the substantial communication cost between servers and clients during federated learning. We demonstrate that when equipped with proper adapters, ASR models under federated tuning can achieve similar performance compared with centralized tuning ones, thus providing a potential direction for future privacy-preserved ASR services. Besides, we investigate the efficiency of different adapters and adapter incorporation strategies under the federated learning setting.

Parameter-Efficient Transfer Learning under Federated Learning for Automatic Speech Recognition

TL;DR

The paper tackles privacy-preserving automatic speech recognition (ASR) across diverse user domains by combining federated learning with parameter-efficient adapter-based domain adaptation. It presents a three-stage training pipeline and a detailed adapter design space for integrating adapters into a Conformer encoder under federated learning, aiming to reduce data and communication costs. Key findings show that federated adapter tuning can match the performance of centralized tuning while dramatically cutting updated parameters and communication, with parallel adapters generally delivering the best transfer and a manageable trade-off between adaptation and original-domain generalization. The work thereby enables practical, privacy-preserving, on-device ASR personalization across accent, dialect, and language variation, guiding future research in efficient federated domain adaptation.

Abstract

This work explores the challenge of enhancing Automatic Speech Recognition (ASR) model performance across various user-specific domains while preserving user data privacy. We employ federated learning and parameter-efficient domain adaptation methods to solve the (1) massive data requirement of ASR models from user-specific scenarios and (2) the substantial communication cost between servers and clients during federated learning. We demonstrate that when equipped with proper adapters, ASR models under federated tuning can achieve similar performance compared with centralized tuning ones, thus providing a potential direction for future privacy-preserved ASR services. Besides, we investigate the efficiency of different adapters and adapter incorporation strategies under the federated learning setting.
Paper Structure (9 sections, 1 equation, 4 figures, 2 tables)

This paper contains 9 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The pipeline incorporates 3 stages with 2 strategies.
  • Figure 2: The structure of Adapter, Conformer Layer, and Conformer Layer w/ various Adapters.
  • Figure 3: The model family of experiments. Beginning with a Pre-trained Encoder in the first stage, we extend to 6 different Pre-Trained models in the second stage after integration with various adapters (model performances are summarized in Table \ref{['tab:pt']}), and ultimately derive 10 models with unique Pre-Trained bases or adapters during the federated tuning process (model performances are summarized in Table \ref{['tab:federated']}, the model index in Figure is matched with the index column in Table \ref{['tab:federated']}).
  • Figure 4: The WER change when tuning PT w/o Adapter model on the Fleurs EN-US dataset. Each column represents a model equipped with a different adapter; the first row depicts the increase in WER for the LibriSpeech dataset post-tuning, while the second row shows the corresponding decrease in WER for the Fleurs EN-US dataset.