Table of Contents
Fetching ...

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Shih-Heng Wang, Zih-Ching Chen, Jiatong Shi, Ming-To Chuang, Guan-Ting Lin, Kuan-Po Huang, David Harwath, Shang-Wen Li, Hung-yi Lee

TL;DR

This paper tackles domain mismatch in low-resource ASR when using speech SSL models pre-trained on high-resource languages. It introduces Intermediate Adaptation (IA) followed by parameter-efficient fine-tuning (PEFT) via adapters, with source languages chosen through a linguistic-tree-based similarity measure to improve transfer to unseen targets. IA can be instantiated with MTL or MAML and yields an enhanced initialization for adapters and the downstream model, while keeping the SSL backbone frozen and limiting tunable parameters to 1-5%. On the ML-SUPERB benchmark, IA-based adaptation achieves up to 28% relative CER/PER improvements over PEFT and matches or surpasses full fine-tuning with far fewer parameters, demonstrating practical efficacy for unseen-language adaptation in low-resource scenarios.

Abstract

The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

TL;DR

This paper tackles domain mismatch in low-resource ASR when using speech SSL models pre-trained on high-resource languages. It introduces Intermediate Adaptation (IA) followed by parameter-efficient fine-tuning (PEFT) via adapters, with source languages chosen through a linguistic-tree-based similarity measure to improve transfer to unseen targets. IA can be instantiated with MTL or MAML and yields an enhanced initialization for adapters and the downstream model, while keeping the SSL backbone frozen and limiting tunable parameters to 1-5%. On the ML-SUPERB benchmark, IA-based adaptation achieves up to 28% relative CER/PER improvements over PEFT and matches or surpasses full fine-tuning with far fewer parameters, demonstrating practical efficacy for unseen-language adaptation in low-resource scenarios.

Abstract

The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.

Paper Structure

This paper contains 15 sections, 4 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: Pipeline of our solution. Before fine-tuning the adapter and downstream model (omitted in the figure) to each target language, we warm up them with Intermediate Adaptation.
  • Figure 2: Our source language selection process with the linguistic tree. Based on the topology of the example linguistic tree, we pick "Luxembourgish" and "Ndebele" instead of "Manx Gaelic" as source languages for IA because they are linguistically closer to our target languages "English" and "Swedish".