Speech Separation for Hearing-Impaired Children in the Classroom
Feyisayo Olalere, Kiki van der Heijden, H. Christiaan Stronks, Jeroen Briaire, Johan H. M. Frijns, Yagmur Güçlütürk
TL;DR
This work tackles speech separation for hearing-impaired children in classrooms, where noise, reverberation, and moving talkers degrade intelligibility and adult-trained models often fail to generalize to children's speech. It introduces a binaural, real-time MIMO-TasNet framework evaluated on simulated classroom scenes with BRIR-based spatialization, two moving talkers, and a DoA-estimator to assess spatial cue preservation. Key findings show that binaural cues substantially reduce the adult-to-child mismatch in clean conditions, but child babble necessitates domain-specific classroom data; importantly, fine-tuning an adult-trained model with only half the classroom data achieves comparable or superior performance while preserving spatial localization. The results support data-efficient adaptation strategies for on-device hearing aids and cochlear implants, highlighting the practical potential of spatially aware speech separation in ecologically valid educational environments, and point to perceptual validation and more complex multi-talker tests as future work.
Abstract
Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children's voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children's speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.
