Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Jialu Li; Mark Hasegawa-Johnson; Karrie Karahalios

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Jialu Li, Mark Hasegawa-Johnson, Karrie Karahalios

TL;DR

This work presents a unified Wav2Vec 2.0–based framework for simultaneous speaker diarization and child vocalization classification in autism assessment, enhanced by phonetically-tuned embeddings from a child phoneme recognizer trained on IPA/SAMPA data. By integrating auxiliary child phonetics as input features and as an auxiliary CTC task using pseudo transcripts, the approach improves child VC while maintaining strong SD performance across two corpora, Rapid-ABC and BabbleCor. The method leverages 4300 hours of child-centered pre-training and demonstrates substantial DER reductions and VC gains, achieving state-of-the-art results on the reproducible BabbleCor subset. A key limitation is the absence of autism diagnoses in the RABC data, but the work lays groundwork for applying phonetically-informed self-supervised representations to early autism screening as more labeled data become available.

Abstract

The assessment of children at risk of autism typically involves a clinician observing, taking notes, and rating children's behaviors. A machine learning model that can label adult and child audio may largely save labor in coding children's behaviors, helping clinicians capture critical events and better communicate with parents. In this study, we leverage Wav2Vec 2.0 (W2V2), pre-trained on 4300-hour of home audio of children under 5 years old, to build a unified system for tasks of clinician-child speaker diarization and vocalization classification (VC). To enhance children's VC, we build a W2V2 phoneme recognition system for children under 4 years old, and we incorporate its phonetically-tuned embeddings as auxiliary features or recognize pseudo phonetic transcripts as an auxiliary task. We test our method on two corpora (Rapid-ABC and BabbleCor) and obtain consistent improvements. Additionally, we outperform the state-of-the-art performance on the reproducible subset of BabbleCor. Code available at https://huggingface.co/lijialudew

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

TL;DR

Abstract

Paper Structure (13 sections, 1 figure, 3 tables)

This paper contains 13 sections, 1 figure, 3 tables.

Introduction
Data
Methodology
Baseline W2V2 systems for child-adult SD and VC
Energy thresholding on two audio channels for SD
Learning phonetic embeddings using W2V2
Auxiliary W2V2 children's phonetics
Experimental Setup
Results
Baseline models
Auxiliary W2V2 phonetic information
BabbleCor
Conclusion & Future Work

Figures (1)

Figure 1: (a): W2V2 model architecture combining adult audio & child audio with/without auxiliary W2V2-Children's PR (W2V2-PR) features or auxiliary W2V2-Children's PR task. Linear=a linear layer, MP=mean pooling, WA=weighted average, and FFN=feed-forward network. ADU VC, CHI VC, CHI PR denote adult VC, child VC, and auxiliary child PR tiers respectively. (b): Illustration of four combination modules. Symbol "$+$" means summation and "$\bigoplus$" means concatenation. For combination $C1$ and $C3$, $\alpha_i+\beta_i=1$ for $i\in\{1,3\}$. (c): Explanation of feature dimension letters.

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

TL;DR

Abstract

Enhancing Child Vocalization Classification with Phonetically-Tuned Embeddings for Assisting Autism Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (1)