Table of Contents
Fetching ...

Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition

Han Zhu, Gaofeng Cheng, Qingwei Zhao, Pengyuan Zhang

TL;DR

This work tackles zero-shot domain adaptation for ASR by transferring target-domain knowledge across languages using cross-lingual pre-training (XLPT) followed by target-language fine-tuning. To preserve cross-language knowledge during fine-tuning, it introduces transliterated ZSDA, where transliteration-based pre-training labels ensure consistency between pre-training and fine-tuning labels, reducing forgetting. The approach combines transliterated XLPT, curriculum XLPT to improve transliteration quality, and continuous pseudo-labeling within a shared-hidden-layer architecture that employs language-specific classifiers. Empirical results show transliterated ZSDA achieves a 9.2% relative WER reduction over a wav2vec 2.0 baseline and outperforms self-supervised ZSDA while matching supervised ZSDA performance, demonstrating strong cross-language domain transfer without source-language transcriptions. This method extends cross-language domain adaptation to low-resource scenarios, enabling better ASR in target domains without labeled data in the target language or source-language annotations.

Abstract

The performance of automatic speech recognition models often degenerates on domains not covered by the training data. Domain adaptation can address this issue, assuming the availability of the target domain data in the target language. However, such assumption does not stand in many real-world applications. To make domain adaptation more applicable, we address the problem of zero-shot domain adaptation (ZSDA), where target domain data is unavailable in the target language. Instead, we transfer the target domain knowledge from another source language where the target domain data is more accessible. To do that, we first perform cross-lingual pre-training (XLPT) to share domain knowledge across languages, then use target language fine-tuning to build the final model. One challenge in this practice is that the pre-trained knowledge can be forgotten during fine-tuning, resulting in sub-optimal adaptation performance. To address this issue, we propose transliterated ZSDA to achieve consistent pre-training and fine-tuning labels, leading to maximum preservation of the pre-trained knowledge. Experimental results show that transliterated ZSDA relatively decreases the word error rate by 9.2% compared with a wav2vec 2.0 baseline. Moreover, transliterated ZSDA consistently outperforms self-supervised ZSDA and performs on par with supervised ZSDA, proving the superiority of transliteration-based pre-training labels.

Transliterated Zero-Shot Domain Adaptation for Automatic Speech Recognition

TL;DR

This work tackles zero-shot domain adaptation for ASR by transferring target-domain knowledge across languages using cross-lingual pre-training (XLPT) followed by target-language fine-tuning. To preserve cross-language knowledge during fine-tuning, it introduces transliterated ZSDA, where transliteration-based pre-training labels ensure consistency between pre-training and fine-tuning labels, reducing forgetting. The approach combines transliterated XLPT, curriculum XLPT to improve transliteration quality, and continuous pseudo-labeling within a shared-hidden-layer architecture that employs language-specific classifiers. Empirical results show transliterated ZSDA achieves a 9.2% relative WER reduction over a wav2vec 2.0 baseline and outperforms self-supervised ZSDA while matching supervised ZSDA performance, demonstrating strong cross-language domain transfer without source-language transcriptions. This method extends cross-language domain adaptation to low-resource scenarios, enabling better ASR in target domains without labeled data in the target language or source-language annotations.

Abstract

The performance of automatic speech recognition models often degenerates on domains not covered by the training data. Domain adaptation can address this issue, assuming the availability of the target domain data in the target language. However, such assumption does not stand in many real-world applications. To make domain adaptation more applicable, we address the problem of zero-shot domain adaptation (ZSDA), where target domain data is unavailable in the target language. Instead, we transfer the target domain knowledge from another source language where the target domain data is more accessible. To do that, we first perform cross-lingual pre-training (XLPT) to share domain knowledge across languages, then use target language fine-tuning to build the final model. One challenge in this practice is that the pre-trained knowledge can be forgotten during fine-tuning, resulting in sub-optimal adaptation performance. To address this issue, we propose transliterated ZSDA to achieve consistent pre-training and fine-tuning labels, leading to maximum preservation of the pre-trained knowledge. Experimental results show that transliterated ZSDA relatively decreases the word error rate by 9.2% compared with a wav2vec 2.0 baseline. Moreover, transliterated ZSDA consistently outperforms self-supervised ZSDA and performs on par with supervised ZSDA, proving the superiority of transliteration-based pre-training labels.

Paper Structure

This paper contains 23 sections, 15 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: The diagram of the proposed ZSDA framework: cross-lingual pre-training (XLPT) and target language fine-tuning. The proposed transliterated ZSDA method is a special case of this framework, where we use transliterated XLPT.
  • Figure 2: Illustration transliterated XLPT. In this example, the target language is Cantonese, whereas the source language is Mandarin. To illustrate how the transliteration reflects the content of the source language speech, we provide the original speech of the source language and the speech synthesized with the transliteration in https://zhu-han.github.io/transliteration.
  • Figure 3: Computation procedure of BT-CTC loss. BT-CTC loss can measure the similarity between transliteration and source language speech.
  • Figure 4: Illustration of representation similarity between the pre-trained and fine-tuned models with transliterated or self-supervised XLPT methods. Higher similarity is better.
  • Figure 5: Comparison of representation before and after fine-tuning for self-supervised and transliterated XLPT models. The same color denotes representations aligned to the same output token. The dot symbol $\cdot$ and the plus symbol $+$ denote representations of target and source domain, respectively. Best viewed in color and zoom in.
  • ...and 2 more figures