Table of Contents
Fetching ...

Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses

Chia-Yu Li, Ngoc Thang Vu

TL;DR

This work tackles the data scarcity challenge in low-resource end-to-end ASR by leveraging CycleGAN and inter-domain losses (CID) trained on abundant external text to improve a teacher model in a noisy student training (NST) framework. It advances CID with automatic hyperparameter tuning and integrates it into a streamlined cNST pipeline that reduces reliance on large amounts of speech data. Empirical results across six non-English languages show substantial WER reductions—about 20% relative to the teacher and roughly 10% relative to the baseline best student—demonstrating that external text-based CID can meaningfully boost semi-supervised ASR. The findings highlight a practical and scalable approach to deploying high-quality ASR for languages with minimal annotated resources.

Abstract

Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech (less than five hours), and abundant external text. Firstly, we observe improved performance by training the model using our previous work on semi-supervised learning "CycleGAN and inter-domain losses" solely with external text. Secondly, we enhance "CycleGAN and inter-domain losses" by incorporating automatic hyperparameter tuning, calling it "enhanced CycleGAN inter-domain losses." Thirdly, we integrate it into the noisy student training approach pipeline for low-resource scenarios. Our experimental results, conducted on six non-English languages from Voxforge and Common Voice, show a 20% word error rate reduction compared to the baseline teacher model and a 10% word error rate reduction compared to the baseline best student model, highlighting the significant improvements achieved through our proposed method.

Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses

TL;DR

This work tackles the data scarcity challenge in low-resource end-to-end ASR by leveraging CycleGAN and inter-domain losses (CID) trained on abundant external text to improve a teacher model in a noisy student training (NST) framework. It advances CID with automatic hyperparameter tuning and integrates it into a streamlined cNST pipeline that reduces reliance on large amounts of speech data. Empirical results across six non-English languages show substantial WER reductions—about 20% relative to the teacher and roughly 10% relative to the baseline best student—demonstrating that external text-based CID can meaningfully boost semi-supervised ASR. The findings highlight a practical and scalable approach to deploying high-quality ASR for languages with minimal annotated resources.

Abstract

Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech (less than five hours), and abundant external text. Firstly, we observe improved performance by training the model using our previous work on semi-supervised learning "CycleGAN and inter-domain losses" solely with external text. Secondly, we enhance "CycleGAN and inter-domain losses" by incorporating automatic hyperparameter tuning, calling it "enhanced CycleGAN inter-domain losses." Thirdly, we integrate it into the noisy student training approach pipeline for low-resource scenarios. Our experimental results, conducted on six non-English languages from Voxforge and Common Voice, show a 20% word error rate reduction compared to the baseline teacher model and a 10% word error rate reduction compared to the baseline best student model, highlighting the significant improvements achieved through our proposed method.
Paper Structure (16 sections, 6 equations, 4 figures, 5 tables)

This paper contains 16 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The framework of CycleGAN and inter-domain losses CY_cycleGAN-inter-domain-losses.
  • Figure 2: The training loss (left) and the accuracy (right) of models using different automatic speech-to-text ratio tuning defined in \ref{['tab:improve_interdomain']}.
  • Figure 3: The training loss and accuracy of models using supervised ratio decay and different automatic speech-to-text ratio tuning defined in \ref{['tab:improve_interdomain']}.
  • Figure 4: WERs on the Common Voice (Finnish and Greek) test set against model generations.