Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions
Anfeng Xu, Kevin Huang, Tiantian Feng, Lue Shen, Helen Tager-Flusberg, Shrikanth Narayanan
TL;DR
This work evaluates speech foundation models for child-adult speaker diarization, reframing diarization as frame-level classification and benchmarking nine pre-trained bases on a dataset of ASD child-parent interactions. The authors demonstrate that foundation-model-based diarization, particularly using Whisper variants, substantially outperforms traditional baselines, achieving up to around 39.5% DER reduction and 62.3% SC reduction while remaining robust across demographics and data-efficient, with strong performance achievable with roughly 2 hours of fine-tuning data. The study provides practical guidance on input window size, demographics, and data efficiency, highlighting the potential of foundation-model representations to advance low-resource child speech understanding and related behavioral analyses.
Abstract
Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.
