Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis
Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios
TL;DR
This work presents a unified Wav2Vec 2.0–based framework for simultaneous speaker diarization and child vocalization classification in autism assessment, enhanced by phonetically-tuned embeddings from a child phoneme recognizer trained on IPA/SAMPA data. By integrating auxiliary child phonetics as input features and as an auxiliary CTC task using pseudo transcripts, the approach improves child VC while maintaining strong SD performance across two corpora, Rapid-ABC and BabbleCor. The method leverages 4300 hours of child-centered pre-training and demonstrates substantial DER reductions and VC gains, achieving state-of-the-art results on the reproducible BabbleCor subset. A key limitation is the absence of autism diagnoses in the RABC data, but the work lays groundwork for applying phonetically-informed self-supervised representations to early autism screening as more labeled data become available.
Abstract
The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-hour of home audio of children under 5 years old, to build a unified system for tasks of clinician-child speaker diarization and vocalization classification (VC). To enhance children's VC, we build a W2V2 phoneme recognition system for children under 4 years old, and we incorporate its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task. We test our method on two corpora (Rapid-ABC and BabbleCor) and obtain consistent improvements. Additionally, we outperform the state-of-the-art performance on the reproducible subset of BabbleCor. Code available at https://huggingface.co/lijialudew
