Table of Contents
Fetching ...

ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

Jiaming Zhou, Shiyao Wang, Shiwan Zhao, Jiabei He, Haoqin Sun, Hui Wang, Cheng Liu, Aobo Kong, Yujie Guo, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin

TL;DR

ChildMandarin introduces a Mandarin child speech dataset targeting ages 3–5, addressing a critical gap with 41.25 hours from 397 speakers across 22 provinces and character-level transcriptions. The work benchmarks ASR using from-scratch (Transformer, Conformer, Paraformer) and pre-trained models (Wav2vec 2.0, HuBERT, CW, Whisper), finding Conformer with CTC-AED and attention rescoring to yield the best CER (~27.38%), while pre-trained baselines improve markedly through fine-tuning. It also evaluates speaker verification with VoxCeleb-pretrained embeddings (x-vector, ECAPA-TDNN, ResNet-TDNN), showing SV is feasible but challenged by small dataset size and young-voice variability, with overfitting observed for larger models. Overall, ChildMandarin provides a valuable, open-source resource for Mandarin child speech research, enabling more robust ASR and SV for young children and informing ethical data practices and future methods like LoRA-based fine-tuning to enhance performance safely.

Abstract

Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children's speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned pre-trained models like HuBERT and Whisper, where fine-tuning demonstrates significant performance improvements. Furthermore, we assess speaker verification (SV) on our dataset, showing that, despite the challenges posed by the unique vocal characteristics of young children, the dataset effectively supports both ASR and SV tasks. This dataset is a valuable contribution to Mandarin child speech research. The dataset is now open-source and freely available for all academic purposes on https://github.com/flageval-baai/ChildMandarin.

ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5

TL;DR

ChildMandarin introduces a Mandarin child speech dataset targeting ages 3–5, addressing a critical gap with 41.25 hours from 397 speakers across 22 provinces and character-level transcriptions. The work benchmarks ASR using from-scratch (Transformer, Conformer, Paraformer) and pre-trained models (Wav2vec 2.0, HuBERT, CW, Whisper), finding Conformer with CTC-AED and attention rescoring to yield the best CER (~27.38%), while pre-trained baselines improve markedly through fine-tuning. It also evaluates speaker verification with VoxCeleb-pretrained embeddings (x-vector, ECAPA-TDNN, ResNet-TDNN), showing SV is feasible but challenged by small dataset size and young-voice variability, with overfitting observed for larger models. Overall, ChildMandarin provides a valuable, open-source resource for Mandarin child speech research, enabling more robust ASR and SV for young children and informing ethical data practices and future methods like LoRA-based fine-tuning to enhance performance safely.

Abstract

Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children's speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned pre-trained models like HuBERT and Whisper, where fine-tuning demonstrates significant performance improvements. Furthermore, we assess speaker verification (SV) on our dataset, showing that, despite the challenges posed by the unique vocal characteristics of young children, the dataset effectively supports both ASR and SV tasks. This dataset is a valuable contribution to Mandarin child speech research. The dataset is now open-source and freely available for all academic purposes on https://github.com/flageval-baai/ChildMandarin.
Paper Structure (24 sections, 1 equation, 6 figures, 12 tables)

This paper contains 24 sections, 1 equation, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Distribution of speakers by age and gender in our dataset
  • Figure 2: Utterance-level and speaker-level duration distribution in our dataset
  • Figure 3: Geographic distribution of speakers in our dataset
  • Figure 4: Proportions of accents and recording devices in our dataset
  • Figure 5: CER (%) comparison of zero-shot and fine-tuning methods using CW model across different age-gender groups
  • ...and 1 more figures