Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

Karl El Hajal; Enno Hermann; Ajinkya Kulkarni; Mathew Magimai. -Doss

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

Karl El Hajal, Enno Hermann, Ajinkya Kulkarni, Mathew Magimai. -Doss

TL;DR

The paper tackles the challenge of ASR on dysarthric speech by introducing an unsupervised Rhythm and Voice (RnV) conversion framework that fuses Urhythmic rhythm modeling with kNN-based voice conversion, all operating on self-supervised speech representations. By using WavLM-Large features and a HiFi-GAN vocoder, the method performs any-to-any, zero-shot conversion from dysarthric to typical speech and evaluates the outputs with a large healthy-speech ASR model (Whisper). Results show that rhythm conversion improves WER, particularly for severe dysarthria, with voice conversion aiding alignment but not consistently surpassing rhythm-only gains; combining rhythm and VC yields variable benefits. The approach requires minimal labeled data and no speaker-specific fine-tuning, enabling practical zero-shot adaptation and offering a pathway for improved assistive-ASR and dysarthria analysis.

Abstract

Automatic speech recognition (ASR) systems are well known to perform poorly on dysarthric speech. Previous works have addressed this by speaking rate modification to reduce the mismatch with typical speech. Unfortunately, these approaches rely on transcribed speech data to estimate speaking rates and phoneme durations, which might not be available for unseen speakers. Therefore, we combine unsupervised rhythm and voice conversion methods based on self-supervised speech representations to map dysarthric to typical speech. We evaluate the outputs with a large ASR model pre-trained on healthy speech without further fine-tuning and find that the proposed rhythm conversion especially improves performance for speakers of the Torgo corpus with more severe cases of dysarthria. Code and audio samples are available at https://idiap.github.io/RnV .

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

TL;DR

Abstract

Paper Structure (9 sections, 4 figures, 1 table)

This paper contains 9 sections, 4 figures, 1 table.

Introduction
Methods
Experimental Setup
Framework implementation
Datasets
Rhythm Analysis
ASR Evaluation
Results
Discussion and conclusions

Figures (4)

Figure 1: Overview of the unsupervised Rhythm and Voice conversion framework.
Figure 2: Segmented waveform of speaker M02 pronouncing the sentence “Carl lives in a lively home”. Ground truth phonemic transcriptions are shown at the bottom for reference ('noi' corresponds to noise).
Figure 3: Visualization of computed rhythm models: (a) Global speaking rates for each Torgo speaker, categorized by severity. (b-d) Comparison of gamma duration distributions per speech type for control speaker MC01 and dysarthric speaker M01.
Figure 4: WER results on Torgo, grouped by severity level, presented for each experimental configuration.

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

TL;DR

Abstract

Unsupervised Rhythm and Voice Conversion of Dysarthric to Healthy Speech for ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (4)