Table of Contents
Fetching ...

Romanization Encoding For Multilingual ASR

Wen Ding, Fei Jia, Hainan Xu, Yu Xi, Junjie Lai, Boris Ginsburg

TL;DR

The paper tackles vocabulary and efficiency challenges in multilingual and code-switching ASR for script-heavy languages by introducing romanization encoding. It integrates a balanced concatenated tokenizer and a Roman2Char (R2C) decoder within a FastConformer-RNNT framework, effectively decoupling acoustic and language modeling to enable larger training batches and lower memory usage. Empirical results on Mandarin-English CS (SEAME) show substantial vocabulary reductions (~63.5%) and MER improvements (~13.7% to ~15.0%), with additional gains when leveraging monolingual data; ablations on Mandarin-Korean and Mandarin-Japanese demonstrate the approach’s transferability and scalability. The work also highlights practical benefits such as faster RNNT inference and the potential for future enhancements through BPE on romanized text and advanced R2C/LLM-based decoding, broadening applicability to other script-heavy languages.

Abstract

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.

Romanization Encoding For Multilingual ASR

TL;DR

The paper tackles vocabulary and efficiency challenges in multilingual and code-switching ASR for script-heavy languages by introducing romanization encoding. It integrates a balanced concatenated tokenizer and a Roman2Char (R2C) decoder within a FastConformer-RNNT framework, effectively decoupling acoustic and language modeling to enable larger training batches and lower memory usage. Empirical results on Mandarin-English CS (SEAME) show substantial vocabulary reductions (~63.5%) and MER improvements (~13.7% to ~15.0%), with additional gains when leveraging monolingual data; ablations on Mandarin-Korean and Mandarin-Japanese demonstrate the approach’s transferability and scalability. The work also highlights practical benefits such as faster RNNT inference and the potential for future enhancements through BPE on romanized text and advanced R2C/LLM-based decoding, broadening applicability to other script-heavy languages.

Abstract

We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.
Paper Structure (12 sections, 1 figure, 6 tables)

This paper contains 12 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: The proposed approach builds upon the baseline Fast-Conformer RNNT model, which incorporates a Concatenated Tokenizer and is outlined within a dashed rectangle. Instead of using direct Char input/output for Mandarin and BPE for English, our approach applies romanization encoding, feeding Pinyin (for Mandarin) and BPE (for English) into the Fast-Conformer RNNT. The Roman to Char Decoder then maps these inputs back to Char and BPE, respectively. The model is trained end-to-end (E2E), combining text-to-text loss with RNNT loss for optimization.