Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Ahmed Adel Attia; Dorottya Demszky; Tolulope Ogunremi; Jing Liu; Carol Espy-Wilson

Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, Jing Liu, Carol Espy-Wilson

TL;DR

The paper addresses the challenge of robust ASR in classroom settings by evaluating continued pretraining (CPT) to adapt Wav2vec2.0 to noisy, multi-speaker classroom data. By pretraining on unlabeled classroom audio from several starting checkpoints and finetuning with labeled data, CPT significantly reduces WER and improves generalization across noise, microphone configurations, and demographics, outperforming non-CPT baselines and a comparable Whisper model in many cases. It also analyzes demographics-related biases, introducing race-aware deanonymization for LM training to help mitigate disparities. The work demonstrates the practical value of CPT for domain adaptation in low-resource, high-variability environments and outlines future expansions, including larger unlabeled datasets and bias-focused tooling.

Abstract

Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones, classroom conditions as well as classroom demographics. Our CPT models show improved ability to generalize to different demographics unseen in the labeled finetuning data.

Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

TL;DR

Abstract

Paper Structure (28 sections, 1 figure, 6 tables)

This paper contains 28 sections, 1 figure, 6 tables.

Introduction
Challenges of Children ASR
Challenges of Classroom environments
Data Scarcity
How Continued Pretraining Can Help
Background: Wav2vec2.0
Related Previous Works
Robust Wav2vec2.0
Noise-Robust Wav2vec2.0
Adaptation of Wav2vec2.0 to low-resource languages through CPT
Datasets
Audio Datasets
NCTE
In house dataset
NCTE-Text
...and 13 more sections

Figures (1)

Figure 1: Wav2vec2.0 pretraining architecture. Lowercase q in circles represents the quantization networks, with the green circle representing the positive sample and the red circles representing the negative samples. Adapted from baevski2020wav2vec

Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

TL;DR

Abstract

Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (1)