Table of Contents
Fetching ...

Improving child speech recognition with augmented child-like speech

Yuanyuan Zhang, Zhengjun Yue, Tanvina Patel, Odette Scharenborg

TL;DR

The paper tackles the limited availability of child speech for CSR by introducing child-to-child voice conversion (VC) as a data augmentation strategy, exploring both monolingual and cross-lingual (Dutch→German) VC. Using AGAIN-VC and speaker-similarity-based target selection, the study generates diverse, child-like speech and evaluates its impact on CSR with Conformer and Whisper models, including analyses of data quantity and quality. Key findings show cross-lingual child-to-child VC yields the strongest CSR gains, with two-fold augmentation often sufficient for fine-tuning scenarios and six-fold augmentation beneficial when training from scratch; even small amounts of high-quality VC data can match the best FT results. The results demonstrate data-efficient CSR improvements and highlight the value of speaker-accurate VC in expanding child speech datasets for low-resource languages and settings.

Abstract

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.

Improving child speech recognition with augmented child-like speech

TL;DR

The paper tackles the limited availability of child speech for CSR by introducing child-to-child voice conversion (VC) as a data augmentation strategy, exploring both monolingual and cross-lingual (Dutch→German) VC. Using AGAIN-VC and speaker-similarity-based target selection, the study generates diverse, child-like speech and evaluates its impact on CSR with Conformer and Whisper models, including analyses of data quantity and quality. Key findings show cross-lingual child-to-child VC yields the strongest CSR gains, with two-fold augmentation often sufficient for fine-tuning scenarios and six-fold augmentation beneficial when training from scratch; even small amounts of high-quality VC data can match the best FT results. The results demonstrate data-efficient CSR improvements and highlight the value of speaker-accurate VC in expanding child speech datasets for low-resource languages and settings.

Abstract

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.
Paper Structure (14 sections, 1 figure, 2 tables)

This paper contains 14 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: (Left panels) Quantity experiments with the Base2 model trained from scratch (green), after fine-tuning (orange) and Whisper with fine-tuning (blue) on child + x-fold child speech generated by $VC_{cl}$. (Right panels) Quality experiments: Conformer Base2 and Whisper with FT on child speech + lowest 10%, 20%,...,90% WER speakers of the two-fold $VC_{cl}$ data. Dashed lines: results of both models on the two-fold $VC_{cl}$ data.