Table of Contents
Fetching ...

TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition

Haolong Zheng, Yekaterina Yegorova, Mark Hasegawa-Johnson

TL;DR

The paper addresses the challenge of recognizing children's speech in low-resource settings with high variability and limited labeled data. It extends the SICL framework by adding an acoustic reranking step to TICL, producing TICL+. The method jointly considers semantic similarity in transcripts and acoustic similarity of audio to select in-context demonstrations, improving transcription without fine-tuning. Experiments across four children’s corpora show substantial WER reductions (up to 53.3% relative vs zero-shot and 37.6% vs TICL), highlighting the effectiveness of combining lexical and acoustic cues for robust, scalable child ASR.

Abstract

Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.

TICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition

TL;DR

The paper addresses the challenge of recognizing children's speech in low-resource settings with high variability and limited labeled data. It extends the SICL framework by adding an acoustic reranking step to TICL, producing TICL+. The method jointly considers semantic similarity in transcripts and acoustic similarity of audio to select in-context demonstrations, improving transcription without fine-tuning. Experiments across four children’s corpora show substantial WER reductions (up to 53.3% relative vs zero-shot and 37.6% vs TICL), highlighting the effectiveness of combining lexical and acoustic cues for robust, scalable child ASR.

Abstract

Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.

Paper Structure

This paper contains 8 sections, 7 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of the TICL+ pipeline