Romanization Encoding For Multilingual ASR

Wen Ding; Fei Jia; Hainan Xu; Yu Xi; Junjie Lai; Boris Ginsburg

Romanization Encoding For Multilingual ASR

Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg

TL;DR

The paper tackles vocabulary and efficiency challenges in multilingual and code-switching ASR for script-heavy languages by introducing romanization encoding. It integrates a balanced concatenated tokenizer and a Roman2Char (R2C) decoder within a FastConformer-RNNT framework, effectively decoupling acoustic and language modeling to enable larger training batches and lower memory usage. Empirical results on Mandarin-English CS (SEAME) show substantial vocabulary reductions (~63.5%) and MER improvements (~13.7% to ~15.0%), with additional gains when leveraging monolingual data; ablations on Mandarin-Korean and Mandarin-Japanese demonstrate the approach’s transferability and scalability. The work also highlights practical benefits such as faster RNNT inference and the potential for future enhancements through BPE on romanized text and advanced R2C/LLM-based decoding, broadening applicability to other script-heavy languages.

Abstract

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

Romanization Encoding For Multilingual ASR

TL;DR

Abstract

Paper Structure (12 sections, 1 figure, 6 tables)

This paper contains 12 sections, 1 figure, 6 tables.

Introduction
Related work
Method
Model structure
Roman and BPE concatenated tokenizer
Roman to Character Decoder
Experiments
Data
Experiment Setup
Results
Ablation Study
Conclusion

Figures (1)

Figure 1: The proposed approach builds upon the baseline Fast-Conformer RNNT model, which incorporates a Concatenated Tokenizer and is outlined within a dashed rectangle. Instead of using direct Char input/output for Mandarin and BPE for English, our approach applies romanization encoding, feeding Pinyin (for Mandarin) and BPE (for English) into the Fast-Conformer RNNT. The Roman to Char Decoder then maps these inputs back to Char and BPE, respectively. The model is trained end-to-end (E2E), combining text-to-text loss with RNNT loss for optimization.

Romanization Encoding For Multilingual ASR

TL;DR

Abstract

Romanization Encoding For Multilingual ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (1)