LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

Wei Liu; Jingyong Hou; Dong Yang; Muyong Cao; Tan Lee

LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

Wei Liu, Jingyong Hou, Dong Yang, Muyong Cao, Tan Lee

TL;DR

LUPET introduces a hierarchical information path that sequentially injects LID, acoustic-unit discovery, IPA phoneme sharing, and MoE-based token recognition into a single multilingual ASR model. By unfolding this path across encoder layers and optimizing with a joint objective, the approach achieves substantial relative WER reductions across 10 Common Voice languages and mitigates the performance gap between high-resource and low-resource languages. Comprehensive experiments and ablations confirm the contributions of each component, showing strong gains over vanilla multilingual setups and baselines, particularly in high-resource languages, and demonstrating that attention-based decoding enhances performance further. This work offers a scalable framework for integrating diverse linguistic signals in multilingual ASR with practical impact on deployment efficiency and cross-lingual transfer.

Abstract

Toward high-performance multilingual automatic speech recognition (ASR), various types of linguistic information and model design have demonstrated their effectiveness independently. They include language identity (LID), phoneme information, language-specific processing modules, and cross-lingual self-supervised speech representation. It is expected that leveraging their benefits synergistically in a unified solution would further improve the overall system performance. This paper presents a novel design of a hierarchical information path, named LUPET, which sequentially encodes, from the shallow layers to deep layers, multiple aspects of linguistic and acoustic information at diverse granularity scales. The path starts from LID prediction, followed by acoustic unit discovery, phoneme sharing, and finally token recognition routed by a mixture-of-expert. ASR experiments are carried out on 10 languages in the Common Voice corpus. The results demonstrate the superior performance of LUPET as compared to the baseline systems. Most importantly, LUPET effectively mitigates the issue of performance compromise of high-resource languages with low-resource ones in the multilingual setting.

LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

TL;DR

Abstract

Paper Structure (16 sections, 10 equations, 2 figures, 3 tables)

This paper contains 16 sections, 10 equations, 2 figures, 3 tables.

Introduction
LUPET
Vanilla E2E Multilingual ASR
Incorporating LUPET
Experimental Setup
Dataset
Multilingual ASR Configurations
Vanilla
LUPET
Baselines
Training Scheme and Evaluation Metric
Results and Analysis
Performance Comparison to Monolingual System
LUPET's Effectiveness Verification
Results on Attention Decoding
...and 1 more sections

Figures (2)

Figure 1: The overall architecture of our proposed LUPET multilingual ASR. LUPET information path unfolds with the encoder layers. { $Enc^{s}$, $Enc^{lm}$, $Enc^{um}$, $Enc^{d}$} represent shallow, lower-middle, upper-middle, deep layers, respectively. $Enc^{s}$ and $Enc^{um}$ are used for LID and IPA phoneme prediction. $Enc^{lm}$ performs acoustic unit discovery with a random-projection quantizer, where $\mathbb{C}$ denotes the codebook for vector quantization (VQ). $Enc^{d}$ denotes conformer layers modified with MoE which consists of four experts and a router. All trapezoid modules refer to linear projection.
Figure 2: Relative WER changes of different systems to monolingual systems on 10 languages by CTC greedy decoding.

LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

TL;DR

Abstract

LUPET: Incorporating Hierarchical Information Path into Multilingual ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (2)