Table of Contents
Fetching ...

Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

Lucas Block Medin, Thomas Pellegrini, Lucile Gelin

TL;DR

The paper tackles phoneme recognition for French children learning to read using self-supervised models. By pretraining on English adult data and fine-tuning on a small in-house French child corpus, WavLM base+ with transformer finetuning outperforms a supervised Transformer+CTC baseline, achieving a PER of 26.1% and demonstrating stronger robustness to classroom noise. However, incorporating the MyST data yields mixed results due to domain differences, while the MyST test set shows the model can generalize to related child speech in English. Overall, the approach highlights the value of large-scale SSL representations for low-resource child speech and their practical impact on reading tutoring systems.

Abstract

Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning

Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning

TL;DR

The paper tackles phoneme recognition for French children learning to read using self-supervised models. By pretraining on English adult data and fine-tuning on a small in-house French child corpus, WavLM base+ with transformer finetuning outperforms a supervised Transformer+CTC baseline, achieving a PER of 26.1% and demonstrating stronger robustness to classroom noise. However, incorporating the MyST data yields mixed results due to domain differences, while the MyST test set shows the model can generalize to related child speech in English. Overall, the approach highlights the value of large-scale SSL representations for low-resource child speech and their practical impact on reading tutoring systems.

Abstract

Child speech recognition is still an underdeveloped area of research due to the lack of data (especially on non-English languages) and the specific difficulties of this task. Having explored various architectures for child speech recognition in previous work, in this article we tackle recent self-supervised models. We first compare wav2vec 2.0, HuBERT and WavLM models adapted to phoneme recognition in French child speech, and continue our experiments with the best of them, WavLM base+. We then further adapt it by unfreezing its transformer blocks during fine-tuning on child speech, which greatly improves its performance and makes it significantly outperform our base model, a Transformer+CTC. Finally, we study in detail the behaviour of these two models under the real conditions of our application, and show that WavLM base+ is more robust to various reading tasks and noise levels. Index Terms: speech recognition, child speech, self-supervised learning

Paper Structure

This paper contains 20 sections, 5 tables.