Table of Contents
Fetching ...

Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang, Lu Wang, Zhiyang Su

TL;DR

Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.

Abstract

Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.

Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

TL;DR

Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.

Abstract

Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.
Paper Structure (11 sections, 10 equations, 4 figures, 2 tables, 2 algorithms)

This paper contains 11 sections, 10 equations, 4 figures, 2 tables, 2 algorithms.

Figures (4)

  • Figure 1: The optimization of heterogeneous language model (LM) pairs in ASR involves (1) local training of source model pairs by curators using private data, and (2) merging multiple models to form a superior target model pair. The values of $m$ and $k$ may differ due to differences in data distribution across curators.
  • Figure 2: Overview of the Genetic Match-and-Merge Algorithm (GMMA). Heterogeneous language models (n-gram and neural LMs) are treated as separate populations and evolved via type-specific genetic operations. The top-$k$ candidates from each population are then paired, and LM combinations with the highest matching fitness are selected for reproduction and merging.
  • Figure 3: Overview of Reinforced Match-and-Merge Algorithm. The merging problem is formulated as a sequential decision-making process, where a reinforcement learning agent selects actions to maximize rewards and enhance model quality.
  • Figure 4: Convergence behavior on training data. The secondary axis (right) corresponds to GMMA due to its larger variance.