Table of Contents
Fetching ...

KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening

Rohan Sharma, Dancheng Liu, Jingchen Sun, Shijie Zhou, Jiayu Qin, Jinjun Xiong, Changyou Chen

TL;DR

KidSpeak tackles the challenge of processing and diagnosing children's speech by introducing a multi-task speech LLM with a phonetic-informed encoder trained in two stages. It combines a Whisper-based audio encoder with dual decoders (orthographic and phonetic) and alignment losses, enabling accurate transcription and pathology classification. To overcome poor-quality pediatric data, it introduces FASA, a flexible forced-alignment tool that yields high-quality aligned datasets from noisy sources, significantly outperforming human annotation. Together, KidSpeak and FASA enable scalable, clinically relevant analysis for child speech disorders and support speech-language pathology workflows.

Abstract

With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.

KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening

TL;DR

KidSpeak tackles the challenge of processing and diagnosing children's speech by introducing a multi-task speech LLM with a phonetic-informed encoder trained in two stages. It combines a Whisper-based audio encoder with dual decoders (orthographic and phonetic) and alignment losses, enabling accurate transcription and pathology classification. To overcome poor-quality pediatric data, it introduces FASA, a flexible forced-alignment tool that yields high-quality aligned datasets from noisy sources, significantly outperforming human annotation. Together, KidSpeak and FASA enable scalable, clinically relevant analysis for child speech disorders and support speech-language pathology workflows.

Abstract

With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (FASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6x compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and FASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview: We propose KidSpeak, a multi-purpose speech based LLM aimed at diagnosis and transcription of kids' speech. The framework leverages a customized speech encoding procedure incorporating phonetic information enhancing downstream performance.
  • Figure 2: The Proposed Framework:KidSpeak uses phonetically informed speech features from the pre-trained multi-head Whisper encoder. The features are concatenated with the text embeddings of the instructions during training endowing the framework with spoken context and textual instruction through self-attention.
  • Figure 3: The speech embeddings are post-processed through a stacking mechanism (Top), ensuring adequate granularity. Thereafter the stacked features are projected onto the feature space of the LLM, ensuring synchronization between the two modalities.
  • Figure 4: Instruction Template: We illustrate two instructions in the general input sequence that we implement to for the IFT procedure. The conversation structure comprises alternating exchanges between a human user and the KidSpeak, where tags $\textcolor{mediumred}{<\mathtt{Aud}>}$ and $\textcolor{mediumred}{</\mathtt{Aud}>}$ demarcate the audio representations. The framework is trained to predict $\mathbf{Y}_{a_t}$ using the aural and instructional context. The $\textcolor{mediumred}{\mathtt{<STOP>}}$ is set to ### in practice.
  • Figure 5: Multi-head Whisper: We employ two separate decoders to decode the same speech segment in English and its Phonetic counterpart. The decoders are further aligned using contrastive and cross-attentive mechanisms, synchronizing the procedure.
  • ...and 2 more figures