Table of Contents
Fetching ...

From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

Ahmed Adel Attia, Dorottya Demszky, Jing Liu, Carol Espy-Wilson

TL;DR

The paper addresses the challenge of low-resource classroom ASR where gold-standard transcriptions are scarce and expensive. It introduces Weakly Supervised Pretraining (WSP), a two-stage approach that pretrains on noisy weak transcripts and then fine-tunes on a small amount of gold data. Through synthetic corruption experiments using TEDLIUM-3 and a real-world classroom corpus (NCTE), WSP improves ASR performance beyond direct fine-tuning and self-training, demonstrating robustness to substantial label noise. The findings suggest that leveraging inexpensive weak supervision can substantially reduce labeling costs while delivering usable transcription accuracy in real-world classroom settings.

Abstract

Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.

From Weak Labels to Strong Results: Utilizing 5,000 Hours of Noisy Classroom Transcripts with Minimal Accurate Data

TL;DR

The paper addresses the challenge of low-resource classroom ASR where gold-standard transcriptions are scarce and expensive. It introduces Weakly Supervised Pretraining (WSP), a two-stage approach that pretrains on noisy weak transcripts and then fine-tunes on a small amount of gold data. Through synthetic corruption experiments using TEDLIUM-3 and a real-world classroom corpus (NCTE), WSP improves ASR performance beyond direct fine-tuning and self-training, demonstrating robustness to substantial label noise. The findings suggest that leveraging inexpensive weak supervision can substantially reduce labeling costs while delivering usable transcription accuracy in real-world classroom settings.

Abstract

Recent progress in speech recognition has relied on models trained on vast amounts of labeled data. However, classroom Automatic Speech Recognition (ASR) faces the real-world challenge of abundant weak transcripts paired with only a small amount of accurate, gold-standard data. In such low-resource settings, high transcription costs make re-transcription impractical. To address this, we ask: what is the best approach when abundant inexpensive weak transcripts coexist with limited gold-standard data, as is the case for classroom speech data? We propose Weakly Supervised Pretraining (WSP), a two-step process where models are first pretrained on weak transcripts in a supervised manner, and then fine-tuned on accurate data. Our results, based on both synthetic and real weak transcripts, show that WSP outperforms alternative methods, establishing it as an effective training methodology for low-resource ASR in real-world scenarios.

Paper Structure

This paper contains 20 sections, 3 tables.