Table of Contents
Fetching ...

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan

TL;DR

This work evaluates speech foundation models for child-adult speaker diarization, reframing diarization as frame-level classification and benchmarking nine pre-trained bases on a dataset of ASD child-parent interactions. The authors demonstrate that foundation-model-based diarization, particularly using Whisper variants, substantially outperforms traditional baselines, achieving up to around 39.5% DER reduction and 62.3% SC reduction while remaining robust across demographics and data-efficient, with strong performance achievable with roughly 2 hours of fine-tuning data. The study provides practical guidance on input window size, demographics, and data efficiency, highlighting the potential of foundation-model representations to advance low-resource child speech understanding and related behavioral analyses.

Abstract

Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

TL;DR

This work evaluates speech foundation models for child-adult speaker diarization, reframing diarization as frame-level classification and benchmarking nine pre-trained bases on a dataset of ASD child-parent interactions. The authors demonstrate that foundation-model-based diarization, particularly using Whisper variants, substantially outperforms traditional baselines, achieving up to around 39.5% DER reduction and 62.3% SC reduction while remaining robust across demographics and data-efficient, with strong performance achievable with roughly 2 hours of fine-tuning data. The study provides practical guidance on input window size, demographics, and data efficiency, highlighting the potential of foundation-model representations to advance low-resource child speech understanding and related behavioral analyses.

Abstract

Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.
Paper Structure (23 sections, 3 figures, 4 tables)

This paper contains 23 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Spoken language assessment pipeline.
  • Figure 2: Comparisons of DER among different demographics (Gender and Language Level). The foundation model used in this experiment is Whisper-Small.
  • Figure 3: Comparisons of DER at different training data ratios. The foundation model used in this experiment is Whisper-Small.