Table of Contents
Fetching ...

Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

Shao-Syuan Huang, Kuan-Po Huang, Andy T. Liu, Hung-yi Lee

TL;DR

The paper tackles the challenge of unseen languages in multilingual ASR by leveraging linguistic relationships through a probability-weighted sum of existing language tag embeddings and a predictor-based refinement to approximate the true embedding. It introduces a weighted-sum embedding (and a trainable variant) and a predictor model (MLP) to map these embeddings toward ground-truth representations, applicable during both fine-tuning and inference. Empirical results show substantial CER and WER improvements in zero-shot and finetuning settings, with predictor-based approaches, particularly corpus-wise, delivering the strongest gains. The approach is resource-efficient compared to large LLM-guided methods and demonstrates practical improvements for extending Whisper to unseen languages.

Abstract

Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.

Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling

TL;DR

The paper tackles the challenge of unseen languages in multilingual ASR by leveraging linguistic relationships through a probability-weighted sum of existing language tag embeddings and a predictor-based refinement to approximate the true embedding. It introduces a weighted-sum embedding (and a trainable variant) and a predictor model (MLP) to map these embeddings toward ground-truth representations, applicable during both fine-tuning and inference. Empirical results show substantial CER and WER improvements in zero-shot and finetuning settings, with predictor-based approaches, particularly corpus-wise, delivering the strongest gains. The approach is resource-efficient compared to large LLM-guided methods and demonstrates practical improvements for extending Whisper to unseen languages.

Abstract

Multilingual Automatic Speech Recognition (ASR) aims to recognize and transcribe speech from multiple languages within a single system. Whisper, one of the most advanced ASR models, excels in this domain by handling 99 languages effectively, leveraging a vast amount of data and incorporating language tags as prefixes to guide the recognition process. However, despite its success, Whisper struggles with unseen languages, those not included in its pre-training. Motivated by the observation that many languages share linguistic characteristics, we propose methods that exploit these relationships to enhance ASR performance on unseen languages. Specifically, we introduce a weighted sum method, which computes a weighted sum of the embeddings of language tags, using Whisper's predicted language probabilities. In addition, we develop a predictor-based approach that refines the weighted sum embedding to more closely approximate the true embedding for unseen languages. Experimental results demonstrate substantial improvements in ASR performance, both in zero-shot and fine-tuning settings. Our proposed methods outperform baseline approaches, providing an effective solution for addressing unseen languages in multilingual ASR.

Paper Structure

This paper contains 16 sections, 3 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Diagram showing the process of obtaining the weighted sum language embedding, with the yellow boxes indicating the embeddings of individual language tokens.