Table of Contents
Fetching ...

Learning When to Trust Which Teacher for Weakly Supervised ASR

Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke

TL;DR

The paper tackles training ASR models under weak supervision with multiple opaque expert teachers by introducing a Smart-Weighter that gate-weights per-utterance transcripts based on input audio. The approach trains a student RNN-T from unlabeled audio using weighted transcripts from several domain experts, without access to internal teacher parameters. Empirical results on LibriSpeech and LibriLight show 4–25% improvements over baselines such as Best-Expert, All-Experts, and ROVER, with additional gains when incorporating ASR entropy as side information. The method enables practical continual adaptation by leveraging unlabeled data and diverse, possibly heterogeneous experts, while maintaining computational efficiency via a gating mechanism instead of full mixture-of-experts architectures.

Abstract

Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.

Learning When to Trust Which Teacher for Weakly Supervised ASR

TL;DR

The paper tackles training ASR models under weak supervision with multiple opaque expert teachers by introducing a Smart-Weighter that gate-weights per-utterance transcripts based on input audio. The approach trains a student RNN-T from unlabeled audio using weighted transcripts from several domain experts, without access to internal teacher parameters. Empirical results on LibriSpeech and LibriLight show 4–25% improvements over baselines such as Best-Expert, All-Experts, and ROVER, with additional gains when incorporating ASR entropy as side information. The method enables practical continual adaptation by leveraging unlabeled data and diverse, possibly heterogeneous experts, while maintaining computational efficiency via a gating mechanism instead of full mixture-of-experts architectures.

Abstract

Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.
Paper Structure (14 sections, 3 figures, 6 tables)

This paper contains 14 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Visualization of training or updating a student model given unlabeled audio. For a given utterance, we have teacher transcripts from multiple opaque experts of differing quality. A Smart-Weighter (W-network) consumes expert transcriptions and utterance audio to weight their quality, with a larger weight given to experts deemed to be more accurate. The student model is trained using semi-supervised learning with audio and paired expert transcriptions using the determined weights.
  • Figure 2: The Smart-Weighter consists of a speech encoder that produces features from an utterance audio and a BERT language model that produces features from expert transcriptions. A transformer-decoder model consumes the BERT features while cross-attending to audio features. The outputs are processed to determine the weights of the expert models.
  • Figure 3: Speaker cluster assignments for expert ASR training based on random assignment (left) and speaker embedding based clusters (right).