Table of Contents
Fetching ...

Speech Separation for Hearing-Impaired Children in the Classroom

Feyisayo Olalere, Kiki van der Heijden, H. Christiaan Stronks, Jeroen Briaire, Johan H. M. Frijns, Yagmur Güçlütürk

TL;DR

This work tackles speech separation for hearing-impaired children in classrooms, where noise, reverberation, and moving talkers degrade intelligibility and adult-trained models often fail to generalize to children's speech. It introduces a binaural, real-time MIMO-TasNet framework evaluated on simulated classroom scenes with BRIR-based spatialization, two moving talkers, and a DoA-estimator to assess spatial cue preservation. Key findings show that binaural cues substantially reduce the adult-to-child mismatch in clean conditions, but child babble necessitates domain-specific classroom data; importantly, fine-tuning an adult-trained model with only half the classroom data achieves comparable or superior performance while preserving spatial localization. The results support data-efficient adaptation strategies for on-device hearing aids and cochlear implants, highlighting the practical potential of spatially aware speech separation in ecologically valid educational environments, and point to perceptual validation and more complex multi-talker tests as future work.

Abstract

Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children's voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children's speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.

Speech Separation for Hearing-Impaired Children in the Classroom

TL;DR

This work tackles speech separation for hearing-impaired children in classrooms, where noise, reverberation, and moving talkers degrade intelligibility and adult-trained models often fail to generalize to children's speech. It introduces a binaural, real-time MIMO-TasNet framework evaluated on simulated classroom scenes with BRIR-based spatialization, two moving talkers, and a DoA-estimator to assess spatial cue preservation. Key findings show that binaural cues substantially reduce the adult-to-child mismatch in clean conditions, but child babble necessitates domain-specific classroom data; importantly, fine-tuning an adult-trained model with only half the classroom data achieves comparable or superior performance while preserving spatial localization. The results support data-efficient adaptation strategies for on-device hearing aids and cochlear implants, highlighting the practical potential of spatially aware speech separation in ecologically valid educational environments, and point to perceptual validation and more complex multi-talker tests as future work.

Abstract

Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children's voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children's speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.

Paper Structure

This paper contains 27 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Simulation pipeline illustrating the generation of reverberant and spatialized speech mixtures for classroom conditions. The process includes simulating room and listener acoustic properties (A), modeling talkers’ movement trajectories (B), and synthesizing classroom speech mixtures (C). The numbers (1) - (5) correspond to the steps itemized in section \ref{['sec:simulation_overall']}
  • Figure 2: Spectrogram of utterances from an adult speaker (left) and a child speaker (right). The child’s speech shows higher fundamental frequency (F0) and greater pitch variation (visible in the more widely spaced and fluctuating harmonic structure). In contrast, the adult’s speech has a lower F0 and denser harmonic spacing, with energy more evenly distributed across a wider frequency range. The green dashed lines indicate estimated F0.
  • Figure 3: Overall architecture of the proposed binaural speech separation and enhancement system han2021binaural. The model processes left and right ear mixed signals ($y_L, y_R$) to output separated ($\hat{S}1_{L,R}$, $\hat{S}2_{L,R}$) and subsequently enhanced speech signals ($\overline{\hat{S}}1_{L,R}$, $\overline{\hat{S}}2_{L,R}$) for two sources. To evaluate whether spatial cues are preserved after enhancement, each enhanced binaural signal is further passed through a dedicated DoA estimation module, which predicts the estimated speaker trajectories ($\hat{\mathcal{T}}$). The DoA estimator is used exclusively for evaluation and is trained independently, ensuring it does not influence the separation or enhancement stages.
  • Figure 4: Speech separation performance of MIMO models trained on different datasets and evaluated on classroom conditions under varying noise conditions. Bar plots show the mean signal-to-noise ratio improvement (SNRi) for the MIMO model trained on the different data splits, Adult, Class, and Finetuned, evaluated on the Class dataset involving either Adult–Child (blue) or Child–Child (orange) talker pairs. (A) Performance of models trained and evaluated in babble-free reverberant classroom conditions. (B) Performance of models trained and evaluated with background babble in reverberant classroom conditions. Error bars indicate the standard error of the mean (SEM). Asterisks (***) denote statistically significant differences across conditions within each model (p $<$ 0.001, Mann-Whitney U test).
  • Figure 5: SNR performance of three models under clean and noisy evaluation conditions. (A) Models trained without babble noise. (B) Models trained with babble noise. All models were evaluated on classroom data with and without background babble. Teal bars represent evaluations without babble; Yellow bars represent evaluations with babble. Error bars indicate the standard error of the mean (SEM). Asterisks (***) denote statistically significant differences (p $<$ 0.001, Mann-Whitney U test).
  • ...and 1 more figures