Table of Contents
Fetching ...

Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives

Christiaan Jacobs, Annelien Smith, Daleen Klop, Ondřej Klejch, Febe de Wet, Herman Kamper

TL;DR

This study tackles automatic speech recognition for spontaneous Afrikaans and isiXhosa preschool narratives to enable automated oral narrative assessment in a low-resource setting. Using Whisper as a base, the authors evaluate a spectrum of strategies—LoRA fine-tuning, in- and out-of-domain adult data, voice conversion, semi-supervised learning, and their combinations—on a compact 5-minute in-domain child dataset drawn from MAIN-based recordings of 4–5 year-olds. Key findings show that in-domain adult speech provides the largest gains, semi-supervised learning offers consistent improvements, and voice conversion helps primarily when applied to in-domain data; LoRA is not consistently beneficial. A combined system leveraging multiple strategies delivers the best development performance (and notable test-set gains), marking a practical step toward scalable automated assessment of preschool narratives in under-represented languages, while highlighting remaining gaps relative to fully supervised toplines. This work validates a broad set of child-speech ASR strategies in an under-explored linguistic setting and informs future development of end-to-end narrative scoring pipelines.

Abstract

We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.

Speech Recognition for Automatically Assessing Afrikaans and isiXhosa Preschool Oral Narratives

TL;DR

This study tackles automatic speech recognition for spontaneous Afrikaans and isiXhosa preschool narratives to enable automated oral narrative assessment in a low-resource setting. Using Whisper as a base, the authors evaluate a spectrum of strategies—LoRA fine-tuning, in- and out-of-domain adult data, voice conversion, semi-supervised learning, and their combinations—on a compact 5-minute in-domain child dataset drawn from MAIN-based recordings of 4–5 year-olds. Key findings show that in-domain adult speech provides the largest gains, semi-supervised learning offers consistent improvements, and voice conversion helps primarily when applied to in-domain data; LoRA is not consistently beneficial. A combined system leveraging multiple strategies delivers the best development performance (and notable test-set gains), marking a practical step toward scalable automated assessment of preschool narratives in under-represented languages, while highlighting remaining gaps relative to fully supervised toplines. This work validates a broad set of child-speech ASR strategies in an under-explored linguistic setting and informs future development of end-to-end narrative scoring pipelines.

Abstract

We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.
Paper Structure (11 sections, 1 figure, 6 tables)