Table of Contents
Fetching ...

Investigating the Effects of Diffusion-based Conditional Generative Speech Models Used for Speech Enhancement on Dysarthric Speech

Joanna Reszka, Parvaneh Janbakhshi, Tilak Purohit, Sadegh Mohammadi

TL;DR

The paper investigates how diffusion-based conditional speech enhancement models trained on clean speech affect dysarthric PD speech presented in ideal conditions. By evaluating processed and residue signals from SGMSE, CDiffuSE, and DiffWave with wav2vec2+MLP and OpenSMILE+RF detectors, the study shows that enhancement can remove dysarthric cues, reducing detection performance, while residue signals often retain or reveal such cues and can improve detection when fused with original speech. The findings highlight that out-of-the-box diffusion models are not yet suitable for dysarthric speech enhancement and that residue information may serve as a valuable complementary feature for pathological speech assessment. This work provides a foundation for developing diffusion models that account for atypical paralinguistic cues and motivates future research on adaptability to diverse speech pathologies and noise conditions.

Abstract

In this study, we aim to explore the effect of pre-trained conditional generative speech models for the first time on dysarthric speech due to Parkinson's disease recorded in an ideal/non-noisy condition. Considering one category of generative models, i.e., diffusion-based speech enhancement, these models are previously trained to learn the distribution of clean (i.e, recorded in a noise-free environment) typical speech signals. Therefore, we hypothesized that when being exposed to dysarthric speech they might remove the unseen atypical paralinguistic cues during the enhancement process. By considering the automatic dysarthric speech detection task, in this study, we experimentally show that during the enhancement process of dysarthric speech data recorded in an ideal non-noisy environment, some of the acoustic dysarthric speech cues are lost. Therefore such pre-trained models are not yet suitable in the context of dysarthric speech enhancement since they manipulate the pathological speech cues when they process clean dysarthric speech. Furthermore, we show that the removed acoustics cues by the enhancement models in the form of residue speech signal can provide complementary dysarthric cues when fused with the original input speech signal in the feature space.

Investigating the Effects of Diffusion-based Conditional Generative Speech Models Used for Speech Enhancement on Dysarthric Speech

TL;DR

The paper investigates how diffusion-based conditional speech enhancement models trained on clean speech affect dysarthric PD speech presented in ideal conditions. By evaluating processed and residue signals from SGMSE, CDiffuSE, and DiffWave with wav2vec2+MLP and OpenSMILE+RF detectors, the study shows that enhancement can remove dysarthric cues, reducing detection performance, while residue signals often retain or reveal such cues and can improve detection when fused with original speech. The findings highlight that out-of-the-box diffusion models are not yet suitable for dysarthric speech enhancement and that residue information may serve as a valuable complementary feature for pathological speech assessment. This work provides a foundation for developing diffusion models that account for atypical paralinguistic cues and motivates future research on adaptability to diverse speech pathologies and noise conditions.

Abstract

In this study, we aim to explore the effect of pre-trained conditional generative speech models for the first time on dysarthric speech due to Parkinson's disease recorded in an ideal/non-noisy condition. Considering one category of generative models, i.e., diffusion-based speech enhancement, these models are previously trained to learn the distribution of clean (i.e, recorded in a noise-free environment) typical speech signals. Therefore, we hypothesized that when being exposed to dysarthric speech they might remove the unseen atypical paralinguistic cues during the enhancement process. By considering the automatic dysarthric speech detection task, in this study, we experimentally show that during the enhancement process of dysarthric speech data recorded in an ideal non-noisy environment, some of the acoustic dysarthric speech cues are lost. Therefore such pre-trained models are not yet suitable in the context of dysarthric speech enhancement since they manipulate the pathological speech cues when they process clean dysarthric speech. Furthermore, we show that the removed acoustics cues by the enhancement models in the form of residue speech signal can provide complementary dysarthric cues when fused with the original input speech signal in the feature space.

Paper Structure

This paper contains 15 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Schematic representation of speech enhancement models to obtain enhanced and residue signals.
  • Figure 2: Schematic representation of dysarthric (e.g., PD) speech detection.