How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Shih-Heng Wang; Zih-Ching Chen; Jiatong Shi; Ming-To Chuang; Guan-Ting Lin; Kuan-Po Huang; David Harwath; Shang-Wen Li; Hung-yi Lee

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Shih-Heng Wang, Zih-Ching Chen, Jiatong Shi, Ming-To Chuang, Guan-Ting Lin, Kuan-Po Huang, David Harwath, Shang-Wen Li, Hung-yi Lee

TL;DR

This paper tackles domain mismatch in low-resource ASR when using speech SSL models pre-trained on high-resource languages. It introduces Intermediate Adaptation (IA) followed by parameter-efficient fine-tuning (PEFT) via adapters, with source languages chosen through a linguistic-tree-based similarity measure to improve transfer to unseen targets. IA can be instantiated with MTL or MAML and yields an enhanced initialization for adapters and the downstream model, while keeping the SSL backbone frozen and limiting tunable parameters to 1-5%. On the ML-SUPERB benchmark, IA-based adaptation achieves up to 28% relative CER/PER improvements over PEFT and matches or surpasses full fine-tuning with far fewer parameters, demonstrating practical efficacy for unseen-language adaptation in low-resource scenarios.

Abstract

The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

TL;DR

Abstract

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)