Table of Contents
Fetching ...

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

Haoyu Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan

TL;DR

The paper tackles zero-resource multilingual ASR for languages lacking labeled data by leveraging self-supervised pre-trained acoustic models to map hidden representations to IPA phonemes and decode with a language model. It introduces a pipeline that fine-tunes XLSR-53, HuBERT-large, or Data2vec-large on phoneme transcriptions, applies vowel splitting for robust cross-language transfer, and extends lexicons with Sequitur G2P for improved decoding. Key findings include a 13.1% WER on Interlingua and an average WER of 33.77% across 8 languages, with competitive performance against baselines trained on as little as 10 hours of labeled data and improvements when audio data are even scarcer. The approach offers a practical path to word-level zero-resource recognition using modest target-language text and limited audio resources, with broad applicability across diverse language families.

Abstract

Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages, and the average error rate on 8 languages is 33.77%.

Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models

TL;DR

The paper tackles zero-resource multilingual ASR for languages lacking labeled data by leveraging self-supervised pre-trained acoustic models to map hidden representations to IPA phonemes and decode with a language model. It introduces a pipeline that fine-tunes XLSR-53, HuBERT-large, or Data2vec-large on phoneme transcriptions, applies vowel splitting for robust cross-language transfer, and extends lexicons with Sequitur G2P for improved decoding. Key findings include a 13.1% WER on Interlingua and an average WER of 33.77% across 8 languages, with competitive performance against baselines trained on as little as 10 hours of labeled data and improvements when audio data are even scarcer. The approach offers a practical path to word-level zero-resource recognition using modest target-language text and limited audio resources, with broad applicability across diverse language families.

Abstract

Labeled audio data is insufficient to build satisfying speech recognition systems for most of the languages in the world. There have been some zero-resource methods trying to perform phoneme or word-level speech recognition without labeled audio data of the target language, but the error rate of these methods is usually too high to be applied in real-world scenarios. Recently, the representation ability of self-supervise pre-trained models has been found to be extremely beneficial in zero-resource phoneme recognition. As far as we are concerned, this paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition. This is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts. Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages, and the average error rate on 8 languages is 33.77%.
Paper Structure (17 sections, 2 figures, 3 tables)

This paper contains 17 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An overview of our method. Notice that the diphthong is split in the model output. Zero-resource recognition may introduce extra mistakes in model output, which can be corrected by the extended lexicon.
  • Figure 2: Comparison of the proposed method and the baseline models on different sizes of training data. Our zero-resource method have competitive performance when training data is less than 10 hours.