Table of Contents
Fetching ...

Automated evaluation of children's speech fluency for low-resource languages

Bowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin

TL;DR

This work tackles automatic assessment of children's speech fluency in low-resource languages by integrating a fine-tuned multilingual ASR with an objective metrics extractor and a GPT-based meta-evaluator. The approach employs data augmentation and LoRA-based ASR fine-tuning to adapt to Malay and Tamil, then uses WER/CER/PER, pause metrics, and speech rate as inputs to a GPT model that predicts fluency. GPT-based meta-evaluation outperforms traditional ML baselines and a multimodal GPT, achieving high accuracy, particularly for Malay. The findings demonstrate a scalable pathway for automated fluency assessment in very low-resource languages and potential applicability to other mother-tongue contexts.

Abstract

Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.

Automated evaluation of children's speech fluency for low-resource languages

TL;DR

This work tackles automatic assessment of children's speech fluency in low-resource languages by integrating a fine-tuned multilingual ASR with an objective metrics extractor and a GPT-based meta-evaluator. The approach employs data augmentation and LoRA-based ASR fine-tuning to adapt to Malay and Tamil, then uses WER/CER/PER, pause metrics, and speech rate as inputs to a GPT model that predicts fluency. GPT-based meta-evaluation outperforms traditional ML baselines and a multimodal GPT, achieving high accuracy, particularly for Malay. The findings demonstrate a scalable pathway for automated fluency assessment in very low-resource languages and potential applicability to other mother-tongue contexts.

Abstract

Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.

Paper Structure

This paper contains 15 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Proposed automatic fluency scoring framework showing adaptation of pre-trained ASR and GPT models using highly augmented low-resource mother tongue (MT) language data.
  • Figure 2: Automatic scoring using tuned GPTs. The fixed "system content" defines the task, the context and the input/output formats. "user content" contains per-instance input data.