Romanization Encoding For Multilingual ASR
Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg
TL;DR
The paper tackles vocabulary and efficiency challenges in multilingual and code-switching ASR for script-heavy languages by introducing romanization encoding. It integrates a balanced concatenated tokenizer and a Roman2Char (R2C) decoder within a FastConformer-RNNT framework, effectively decoupling acoustic and language modeling to enable larger training batches and lower memory usage. Empirical results on Mandarin-English CS (SEAME) show substantial vocabulary reductions (~63.5%) and MER improvements (~13.7% to ~15.0%), with additional gains when leveraging monolingual data; ablations on Mandarin-Korean and Mandarin-Japanese demonstrate the approach’s transferability and scalability. The work also highlights practical benefits such as faster RNNT inference and the potential for future enhancements through BPE on romanized text and advanced R2C/LLM-based decoding, broadening applicability to other script-heavy languages.
Abstract
We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.
