Table of Contents
Fetching ...

LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

Sangmin Lee, Woo-Jin Chung, Hong-Goo Kang

TL;DR

LAMA-UT tackles universal multilingual ASR by separating the problem into orthography unification and language-specific transliteration, enabling broad language coverage including unseen languages. The two-phase pipeline uses Romanization to create a universal transcription, then leverages frozen LLMs as universal converters to produce language-specific outputs, achieving competitive results with only about 680 hours of training data. Across 102 seen languages and 25 unseen languages, LAMA-UT demonstrates substantial improvements over Whisper and parity with MMS while avoiding language-specific adapters or lexicons. This approach offers a practical, data-efficient path toward flexible, scalable multilingual ASR, with potential for enhancement via improved prompts and larger LLMs.

Abstract

Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration

TL;DR

LAMA-UT tackles universal multilingual ASR by separating the problem into orthography unification and language-specific transliteration, enabling broad language coverage including unseen languages. The two-phase pipeline uses Romanization to create a universal transcription, then leverages frozen LLMs as universal converters to produce language-specific outputs, achieving competitive results with only about 680 hours of training data. Across 102 seen languages and 25 unseen languages, LAMA-UT demonstrates substantial improvements over Whisper and parity with MMS while avoiding language-specific adapters or lexicons. This approach offers a practical, data-efficient path toward flexible, scalable multilingual ASR, with potential for enhancement via improved prompts and larger LLMs.

Abstract

Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper's training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.

Paper Structure

This paper contains 41 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Illustration of our universal ASR pipeline.
  • Figure 2: Problems derived in single-token IPA recognition. Diacritic ':' indicates phoneme length, which has no explicit phonetic value. $\epsilon$ denotes a blank token in CTC.
  • Figure 3: CER comparison between LAMA-UT and Whisper