Building English ASR model with regional language support
Purvi Agrawal, Vikas Joshi, Bharati Patidar, Ankur Gupta, Rupesh Kumar Mehta
TL;DR
This paper tackles improving an English ASR system's handling of Hindi queries without sacrificing English accuracy. It introduces SplitHead with Attention (SHA), an acoustic model with shared hidden layers and language-specific heads fused by self-attention, paired with a Hinglish language model created by interpolating English text with transliterated Hindi text. Through staged training and teacher-student distillation, the SHA model achieves up to 69.3% relative WER reduction on Hindi and 5.7% on English compared to a monolingual baseline, while the Hinglish LM, particularly with contextual transliteration, further boosts Hindi performance. The work demonstrates a practical, cost-efficient approach to regional-language support in English ASR, with strong implications for real-world multilingual voice assistants and customer-service applications.
Abstract
In this paper, we present a novel approach to developing an English Automatic Speech Recognition (ASR) system that can effectively handle Hindi queries, without compromising its performance on English. We propose a novel acoustic model (AM), referred to as SplitHead with Attention (SHA) model, features shared hidden layers across languages and language-specific projection layers combined via a self-attention mechanism. This mechanism estimates the weight for each language based on input data and weighs the corresponding language-specific projection layers accordingly. Additionally, we propose a language modeling approach that interpolates n-gram models from both English and transliterated Hindi text corpora. Our results demonstrate the effectiveness of our approach, with a 69.3% and 5.7% relative reduction in word error rate on Hindi and English test sets respectively when compared to a monolingual English model.
