CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

Ahmed Adel Attia; Dorottya Demszky; Tolulope Ogunremi; Jing Liu; Carol Espy-Wilson

CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, Jing Liu, Carol Espy-Wilson

TL;DR

Classroom ASR faces significant challenges from children's speech and noisy environments. The authors demonstrate that continued pretraining (CPT) of Wav2vec2.0 on unlabeled classroom data substantially improves robustness and reduces WER, outperforming several SOTA baselines in many conditions. They provide a thorough cross-dataset evaluation (NCTE and MPT) and show that CPT benefits persist across different starting checkpoints and are complemented by a lightweight 5-gram LM trained on NCTE-Text. The work establishes CPT as a strong, data-efficient approach for domain adaptation in low-resource, noise-rich settings and outlines actionable guidelines and future directions for equitable, scalable classroom speech technologies.

Abstract

Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones and classroom conditions.

CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

TL;DR

Abstract

CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments

Authors

TL;DR

Abstract

Table of Contents