Table of Contents
Fetching ...

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

TL;DR

This work presents a unified Wav2Vec 2.0–based framework for simultaneous speaker diarization and child vocalization classification in autism assessment, enhanced by phonetically-tuned embeddings from a child phoneme recognizer trained on IPA/SAMPA data. By integrating auxiliary child phonetics as input features and as an auxiliary CTC task using pseudo transcripts, the approach improves child VC while maintaining strong SD performance across two corpora, Rapid-ABC and BabbleCor. The method leverages 4300 hours of child-centered pre-training and demonstrates substantial DER reductions and VC gains, achieving state-of-the-art results on the reproducible BabbleCor subset. A key limitation is the absence of autism diagnoses in the RABC data, but the work lays groundwork for applying phonetically-informed self-supervised representations to early autism screening as more labeled data become available.

Abstract

The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-hour of home audio of children under 5 years old, to build a unified system for tasks of clinician-child speaker diarization and vocalization classification (VC). To enhance children's VC, we build a W2V2 phoneme recognition system for children under 4 years old, and we incorporate its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task. We test our method on two corpora (Rapid-ABC and BabbleCor) and obtain consistent improvements. Additionally, we outperform the state-of-the-art performance on the reproducible subset of BabbleCor. Code available at https://huggingface.co/lijialudew

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

TL;DR

This work presents a unified Wav2Vec 2.0–based framework for simultaneous speaker diarization and child vocalization classification in autism assessment, enhanced by phonetically-tuned embeddings from a child phoneme recognizer trained on IPA/SAMPA data. By integrating auxiliary child phonetics as input features and as an auxiliary CTC task using pseudo transcripts, the approach improves child VC while maintaining strong SD performance across two corpora, Rapid-ABC and BabbleCor. The method leverages 4300 hours of child-centered pre-training and demonstrates substantial DER reductions and VC gains, achieving state-of-the-art results on the reproducible BabbleCor subset. A key limitation is the absence of autism diagnoses in the RABC data, but the work lays groundwork for applying phonetically-informed self-supervised representations to early autism screening as more labeled data become available.

Abstract

The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-hour of home audio of children under 5 years old, to build a unified system for tasks of clinician-child speaker diarization and vocalization classification (VC). To enhance children's VC, we build a W2V2 phoneme recognition system for children under 4 years old, and we incorporate its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task. We test our method on two corpora (Rapid-ABC and BabbleCor) and obtain consistent improvements. Additionally, we outperform the state-of-the-art performance on the reproducible subset of BabbleCor. Code available at https://huggingface.co/lijialudew
Paper Structure (13 sections, 1 figure, 3 tables)

This paper contains 13 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: (a): W2V2 model architecture combining adult audio & child audio with/without auxiliary W2V2-Children's PR (W2V2-PR) features or auxiliary W2V2-Children's PR task. Linear=a linear layer, MP=mean pooling, WA=weighted average, and FFN=feed-forward network. ADU VC, CHI VC, CHI PR denote adult VC, child VC, and auxiliary child PR tiers respectively. (b): Illustration of four combination modules. Symbol "$+$" means summation and "$\bigoplus$" means concatenation. For combination $C1$ and $C3$, $\alpha_i+\beta_i=1$ for $i\in\{1,3\}$. (c): Explanation of feature dimension letters.