A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings

Tariq Adnan; Abdelrahman Abdelkader; Zipei Liu; Ekram Hossain; Sooyong Park; MD Saiful Islam; Ehsan Hoque

A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings

Tariq Adnan, Abdelrahman Abdelkader, Zipei Liu, Ekram Hossain, Sooyong Park, MD Saiful Islam, Ehsan Hoque

TL;DR

This work presents a framework to recognize Parkinson's disease through an English pangram utterance speech collected using a web application from diverse recording settings and environments, including participants' homes, and demonstrates superior performance over standard concatenation-based fusion models and other baselines.

Abstract

We present a framework to recognize Parkinson's disease (PD) through an English pangram utterance speech collected using a web application from diverse recording settings and environments, including participants' homes. Our dataset includes a global cohort of 1306 participants, including 392 diagnosed with PD. Leveraging the diversity of the dataset, spanning various demographic properties (such as age, sex, and ethnicity), we used deep learning embeddings derived from semi-supervised models such as Wav2Vec 2.0, WavLM, and ImageBind representing the speech dynamics associated with PD. Our novel fusion model for PD classification, which aligns different speech embeddings into a cohesive feature space, demonstrated superior performance over standard concatenation-based fusion models and other baselines (including models built on traditional acoustic features). In a randomized data split configuration, the model achieved an Area Under the Receiver Operating Characteristic Curve (AUROC) of 88.94% and an accuracy of 85.65%. Rigorous statistical analysis confirmed that our model performs equitably across various demographic subgroups in terms of sex, ethnicity, and age, and remains robust regardless of disease duration. Furthermore, our model, when tested on two entirely unseen test datasets collected from clinical settings and from a PD care center, maintained AUROC scores of 82.12% and 78.44%, respectively. This affirms the model's robustness and it's potential to enhance accessibility and health equity in real-world applications.

A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings

TL;DR

Abstract

Paper Structure (19 sections, 6 figures, 8 tables)

This paper contains 19 sections, 6 figures, 8 tables.

Introduction
Results
Dataset
Feature Extraction
Performance Evaluation on Standard Train-Validation-Test Split
Generalizability Test on External Datasets
Error Analysis
Ablation Studies
Discussion
Methods
Dataset Description
Digital Speech Feature Extraction
Feature Pre-processing
Baseline Modeling
Fusion Modeling
...and 4 more sections

Figures (6)

Figure 1: Our proposed framework of fusion based PD classifier using deep embeddings from WavLM and ImageBind. First, the speech is separated from video datasets. Then the segment of the audio file where the participants utter the pangram is separated. Vector embeddings from the last layers of WavLM and ImagBind are extracted for the speech data. Then WavLM feautures are projected into the space of ImagBind features set. Finally the projected features are fused and passed through a classification layer that can determine the participant as PD or control. Note that the image of the person is AI generated.
Figure 2: Performance evaluation of PD classifiers from speech in a random split configuration. (a) and (b) respectively demonstrate the AUROC curve and the confusion matrix of our best performing novel fusion model which projects WavLM features into the feature space of ImageBind features.
Figure 3: Performance evaluation of PD classifiers from speech on external test sets. (a) and (c) respectively demonstrate the AUROC curve and the confusion matrix of our best performing novel fusion model when tested on the dataset collected from PD Care Facility. In contrast, (b) and (d) give such visualizations when the model was tested on the participating cohort from Clinical Setup.
Figure 4: Decision tree maps of error rates and error coverage among demographic cohorts. (a) and (b) respectively demonstrate the notable nodes/cohorts with relatively high error rates and their error coverage percentage. The two numbers within each tree node represent the misclassified counts and the total counts of individuals in that specific cohort (i.e., $26/183$ indicates $26$ out of $183$ individuals were misclassified). Labels on the branch (i.e., age $\leq 68.50$) represent the decision boundary condition to split the child subtrees.
Figure 5: Heat maps of error rates and error coverage among demographic cohorts. A stronger blue color indicates a cohort has a higher error rate, while red indicates that the cohort is empty. Within each cohort: upper row numbers represent the misclassified counts out of the total counts (i.e. incorrect/total); bottom row number represent error rate in percentage.
...and 1 more figures

A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings

TL;DR

Abstract

A Novel Fusion Architecture for PD Detection Using Semi-Supervised Speech Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (6)