Understanding Zero-shot Rare Word Recognition Improvements Through LLM Integration
Haoxuan Wang
TL;DR
This work tackles the persistent challenge of rare word recognition in ASR by integrating a large language model (LLM) into an encoder–decoder framework. Using a 190k-hour YouTube-derived dataset with Whisper V3 pseudo-labeling, the authors couple Whisper V2 as the speech encoder, an adapter for aligning acoustic features to the LLM, and Qwen-7B-Chat as the decoder (via LoRA fine-tuning). The results show that the LLM–ASR architecture yields marked improvements in rare-word recognition (R-WER) while maintaining competitive general transcription metrics (O-WER, N-WER) compared to a Zipformer Transducer baseline; data quality and the adapter play critical roles in achieving these gains. These findings highlight the potential of large-scale LLM–based speech recognition systems to enhance transcription accuracy for long-tail vocabulary, with implications for robustness and downstream applications, and point to future work on alignment, latency reduction, and domain adaptation.
Abstract
In this study, we investigate the integration of a large language model (LLM) with an automatic speech recognition (ASR) system, specifically focusing on enhancing rare word recognition performance. Using a 190,000-hour dataset primarily sourced from YouTube, pre-processed with Whisper V3 pseudo-labeling, we demonstrate that the LLM-ASR architecture outperforms traditional Zipformer-Transducer models in the zero-shot rare word recognition task, after training on a large dataset. Our analysis reveals that the LLM contributes significantly to improvements in rare word error rate (R-WER), while the speech encoder primarily determines overall transcription performance (Orthographic Word Error Rate, O-WER, and Normalized Word Error Rate, N-WER). Through extensive ablation studies, we highlight the importance of adapter integration in aligning speech encoder outputs with the LLM's linguistic capabilities. Furthermore, we emphasize the critical role of high-quality labeled data in achieving optimal performance. These findings provide valuable insights into the synergy between LLM-based ASR architectures, paving the way for future advancements in large-scale LLM-based speech recognition systems.
