An Effective Training Framework for Light-Weight Automatic Speech Recognition Models
Abdul Hannan, Alessio Brutti, Shah Nawaz, Mubashir Noman
TL;DR
The paper addresses the challenge of deploying ASR on resource-constrained devices by introducing a two-step representation-learning framework that derives multiple small models from a single large reference model. The EncRL stage freezes a large model and trains a lighter encoder to align representations and outputs via CLIP-style and MSE losses, followed by a finetuning stage that attaches a CTC decoder and optimizes with limited epochs. Experiments on LibriSpeech and TED-LIUM show the approach achieves about a three-fold training speed-up and maintains or improves WER relative to training small models from scratch or pruning baselines, with CLIP+MSE losses providing the best performance. The framework also demonstrates robustness across model sizes, and prunes-based baselines are substantially outperformed, highlighting its practical impact for deploying efficient ASR on devices with restricted compute and memory budgets.
Abstract
Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
