Table of Contents
Fetching ...

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman

TL;DR

The paper addresses the challenge of deploying ASR on resource-constrained devices by introducing a two-step representation-learning framework that derives multiple small models from a single large reference model. The EncRL stage freezes a large model and trains a lighter encoder to align representations and outputs via CLIP-style and MSE losses, followed by a finetuning stage that attaches a CTC decoder and optimizes with limited epochs. Experiments on LibriSpeech and TED-LIUM show the approach achieves about a three-fold training speed-up and maintains or improves WER relative to training small models from scratch or pruning baselines, with CLIP+MSE losses providing the best performance. The framework also demonstrates robustness across model sizes, and prunes-based baselines are substantially outperformed, highlighting its practical impact for deploying efficient ASR on devices with restricted compute and memory budgets.

Abstract

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

TL;DR

The paper addresses the challenge of deploying ASR on resource-constrained devices by introducing a two-step representation-learning framework that derives multiple small models from a single large reference model. The EncRL stage freezes a large model and trains a lighter encoder to align representations and outputs via CLIP-style and MSE losses, followed by a finetuning stage that attaches a CTC decoder and optimizes with limited epochs. Experiments on LibriSpeech and TED-LIUM show the approach achieves about a three-fold training speed-up and maintains or improves WER relative to training small models from scratch or pruning baselines, with CLIP+MSE losses providing the best performance. The framework also demonstrates robustness across model sizes, and prunes-based baselines are substantially outperformed, highlighting its practical impact for deploying efficient ASR on devices with restricted compute and memory budgets.

Abstract

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.

Paper Structure

This paper contains 12 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Here we illustrate the overall framework of the proposed method. (i) In the feature learning phase, the input utterance is fed to a large ASR model (reference model) to extract the features $e_{ref}$ and $b_{ref}$. Afterwards, the same input is passed through the smaller model to obtain features $e_{LW}$ and $b_{LW}$. To learn the knowledge of reference model, MSE loss is used between the outputs ($b_{ref}$ and $b_{LW}$) of the classifiers of the reference and the small model. Symmetric cross-entropy loss is used on the features $e_{ref}^{N}$ and $e_{LW}^{M}$ to align the feature spaces. (ii) After transferring the knowledge to the encoder of light-weight model, CTC decoder is integrated and the model is finetuned to transcribe the input speech.