Table of Contents
Fetching ...

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Jaesung Bae, Xiuwen Zheng, Minje Kim, Chang D. Yoo, Mark Hasegawa-Johnson

Abstract

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

Abstract

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.
Paper Structure (8 sections, 3 figures, 2 tables)

This paper contains 8 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 3: t-SNE figures after stage 2 with various contrastive loss choices. We randomly select 1000 samples in LibriSpeech and SAP training data. These are the embeddings right after the temporal pooling. Blue cross represents the LibriSpeech data, and circles indicate the SAP dataset. From red to green, the color indicates the low-severity to high-severity levels. Best viewed in color.
  • Figure 4: The improvement percentages of SRCC and PCC over the Baseline model vary with different values of $\tau$. (a) In-domain testset (SAP dataset) and (b) average scores of cross-domain testsets. In general, the performance of our proposed methods improves as $\tau$ increases. Although SimCLR achieves the best cross-domain average performance at $\tau=10$, its in-domain test performance deteriorates significantly, highlighting the robustness of our proposed methods.
  • Figure 5: Embedding spaces with different $\tau$. Since the LibriSpeech and SAP datasets have distinct characteristics, they can be considered easy positive/negative pairs. Therefore, with a small $\tau$, contrastive learning tends not to associate pairs from the LibriSpeech and SAP datasets. In contrast, with a large $\tau$, they are better harmonized, enhancing the robustness of the downstream regression model.