Table of Contents
Fetching ...

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

Shuhe Li, Chenxu Guo, Jiachen Lian, Cheol Jun Cho, Wenshuo Zhao, Xiner Xu, Ruiyu Jin, Xiaoyu Shi, Xuanru Zhou, Dingkun Zhou, Sam Wang, Grace Wang, Jingze Yang, Jingyi Xu, Ruohan Bao, Xingrui Chen, Elise Brenner, Brandon In, Francesca Pei, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli

TL;DR

The paper tackles the challenge of automatic, fine-grained assessment of young children's language, where traditional word-level metrics fail to capture sub-word errors due to high-pitched voices and data sparsity. It introduces K-Function, a three-stage pipeline combining K-WFST phoneme transcription with an LLM-based scoring module to produce objective language-function scores and feedback. The K-WFST incorporates a phoneme similarity matrix and adaptive K-selection to robustly transcribe child speech, achieving state-of-the-art phoneme error rates on the MyST and Multitudes datasets, while the LLM scoring shows strong agreement with human proctors when fed high-quality transcripts. The framework demonstrates that precise phoneme recognition substantially improves downstream scoring and enables scalable screening and intervention planning in educational settings.

Abstract

Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.

K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function

TL;DR

The paper tackles the challenge of automatic, fine-grained assessment of young children's language, where traditional word-level metrics fail to capture sub-word errors due to high-pitched voices and data sparsity. It introduces K-Function, a three-stage pipeline combining K-WFST phoneme transcription with an LLM-based scoring module to produce objective language-function scores and feedback. The K-WFST incorporates a phoneme similarity matrix and adaptive K-selection to robustly transcribe child speech, achieving state-of-the-art phoneme error rates on the MyST and Multitudes datasets, while the LLM scoring shows strong agreement with human proctors when fed high-quality transcripts. The framework demonstrates that precise phoneme recognition substantially improves downstream scoring and enables scalable screening and intervention planning in educational settings.

Abstract

Evaluating young children's language is challenging for automatic speech recognizers due to high-pitched voices, prolonged sounds, and limited data. We introduce K-Function, a framework that combines accurate sub-word transcription with objective, Large Language Model (LLM)-driven scoring. Its core, Kids-Weighted Finite State Transducer (K-WFST), merges an acoustic phoneme encoder with a phoneme-similarity model to capture child-specific speech errors while remaining fully interpretable. K-WFST achieves a 1.39 % phoneme error rate on MyST and 8.61 % on Multitudes-an absolute improvement of 10.47 % and 7.06 % over a greedy-search decoder. These high-quality transcripts are used by an LLM to grade verbal skills, developmental milestones, reading, and comprehension, with results that align closely with human evaluators. Our findings show that precise phoneme recognition is essential for creating an effective assessment framework, enabling scalable language screening for children.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 3 tables, 1 algorithm.

Figures (1)

  • Figure 1: The 3-stage pipeline of the K-Function framework, from audio input to a comprehensive feedback report.