Table of Contents
Fetching ...

RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines

Ahmed Adel Attia, Jing Liu, Carol Espy Wilson

TL;DR

RealClass addresses the lack of large-scale, public classroom speech data by synthesizing realistic classroom acoustics and noise using Unity, and by pairing child speech from public corpora with instructional adult speech to form a clean base. The authors generate a sizable 391-hour dataset, create a diverse 50-hour classroom noise corpus, and build a classroom RIR bank via ESS-driven simulations, enabling efficient, scalable data creation. Validation on classroom ASR benchmarks shows RealClass closely approximates real classroom speech and, when combined with a limited amount of real classroom data or with CPT pretraining, yields substantial WER improvements. The work demonstrates that synthetic classroom data can substitute or augment real data, with public release plans that may accelerate robust classroom speech technologies in education.

Abstract

The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Classroom datasets remain limited and not publicly available, and the absence of dedicated classroom noise or Room Impulse Response (RIR) corpora prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise and RIRs using game engines, a versatile framework that can extend to other domains beyond the classroom. Building on this methodology, we present RealClass, a dataset that combines a synthesized classroom noise corpus with a classroom speech dataset compiled from publicly available corpora. The speech data pairs a children's speech corpus with instructional speech extracted from YouTube videos to approximate real classroom interactions in clean conditions. Experiments on clean and noisy speech show that RealClass closely approximates real classroom speech, making it a valuable asset in the absence of abundant real classroom speech.

RealClass: A Framework for Classroom Speech Simulation with Public Datasets and Game Engines

TL;DR

RealClass addresses the lack of large-scale, public classroom speech data by synthesizing realistic classroom acoustics and noise using Unity, and by pairing child speech from public corpora with instructional adult speech to form a clean base. The authors generate a sizable 391-hour dataset, create a diverse 50-hour classroom noise corpus, and build a classroom RIR bank via ESS-driven simulations, enabling efficient, scalable data creation. Validation on classroom ASR benchmarks shows RealClass closely approximates real classroom speech and, when combined with a limited amount of real classroom data or with CPT pretraining, yields substantial WER improvements. The work demonstrates that synthetic classroom data can substitute or augment real data, with public release plans that may accelerate robust classroom speech technologies in education.

Abstract

The scarcity of large-scale classroom speech data has hindered the development of AI-driven speech models for education. Classroom datasets remain limited and not publicly available, and the absence of dedicated classroom noise or Room Impulse Response (RIR) corpora prevents the use of standard data augmentation techniques. In this paper, we introduce a scalable methodology for synthesizing classroom noise and RIRs using game engines, a versatile framework that can extend to other domains beyond the classroom. Building on this methodology, we present RealClass, a dataset that combines a synthesized classroom noise corpus with a classroom speech dataset compiled from publicly available corpora. The speech data pairs a children's speech corpus with instructional speech extracted from YouTube videos to approximate real classroom interactions in clean conditions. Experiments on clean and noisy speech show that RealClass closely approximates real classroom speech, making it a valuable asset in the absence of abundant real classroom speech.

Paper Structure

This paper contains 15 sections, 2 tables.