Children's Speech Recognition through Discrete Token Enhancement
Vrunda N. Sukhadia, Shammur Absar Chowdhury
TL;DR
The paper tackles the data scarcity and privacy challenges inherent in children's speech recognition by introducing end-to-end ASR that consumes discrete tokens derived from self-supervised learning (SSL) embeddings. It investigates single-view and multi-view codebooks to quantize frame-level SSL features (HuBERT-large-ll60k and WavLM-large) and feeds the discrete sequences into an encoder–decoder architecture with CTC/attention multitask learning. Results show that discrete-token ASR achieves performance close to continuous SSL baselines while drastically reducing model size (about 83% smaller than Whisper Small and 94% smaller than Whisper Medium) and improving privacy, with strong generalization to unseen domains and non-native speech. The work highlights the potential of privacy-preserving, compact discrete representations for child speech and suggests that multi-view SSL integration can further enhance robustness and accuracy.
Abstract
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
