Table of Contents
Fetching ...

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

Yao-Fei Cheng, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang

TL;DR

The paper tackles the challenge of building ASR for extremely low-resource Austronesian languages Amis and Seediq with minimal labeled data. It introduces a data-selection strategy that uses language embeddings from a spoken language ID model and three one-class classifiers to curate acoustically similar utterances from a multilingual corpus for continued pre-training of a multilingual SSL model (XLSR-300M). Across extensive experiments, larger SSL models generally improve performance but can overfit with scarce data; the proposed ensemble data-selection and increased pre-training data improve robustness and reduce CER. Overall, the study demonstrates the feasibility of cross-lingual transfer learning to boost ASR in severely under-resourced languages and provides new corpora and methods for data augmentation in this domain.

Abstract

This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages

TL;DR

The paper tackles the challenge of building ASR for extremely low-resource Austronesian languages Amis and Seediq with minimal labeled data. It introduces a data-selection strategy that uses language embeddings from a spoken language ID model and three one-class classifiers to curate acoustically similar utterances from a multilingual corpus for continued pre-training of a multilingual SSL model (XLSR-300M). Across extensive experiments, larger SSL models generally improve performance but can overfit with scarce data; the proposed ensemble data-selection and increased pre-training data improve robustness and reduce CER. Overall, the study demonstrates the feasibility of cross-lingual transfer learning to boost ASR in severely under-resourced languages and provides new corpora and methods for data augmentation in this domain.

Abstract

This study investigates the efficacy of data augmentation techniques for low-resource automatic speech recognition (ASR), focusing on two endangered Austronesian languages, Amis and Seediq. Recognizing the potential of self-supervised learning (SSL) in low-resource settings, we explore the impact of data volume on the continued pre-training of SSL models. We propose a novel data-selection scheme leveraging a multilingual corpus to augment the limited target language data. This scheme utilizes a language classifier to extract utterance embeddings and employs one-class classifiers to identify utterances phonetically and phonologically proximate to the target languages. Utterances are ranked and selected based on their decision scores, ensuring the inclusion of highly relevant data in the SSL-ASR pipeline. Our experimental results demonstrate the effectiveness of this approach, yielding substantial improvements in ASR performance for both Amis and Seediq. These findings underscore the feasibility and promise of data augmentation through cross-lingual transfer learning for low-resource language ASR.
Paper Structure (13 sections, 2 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: Diagram for picking the top-$k$ hours data.
  • Figure 2: The effect of data amount and selection on continued pre-training XLSR-128 for Amis and Seediq. The x-axis represents the amount of sampled data (in hours), and '1' means 1 hour of non-target language data is used together with 1 hour of target language data in continued pre-training. The shades in blue and red represent the amount of pre-training data.