Table of Contents
Fetching ...

Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology

Birger Moell, Fredrik Sand Aronsson

TL;DR

Addresses data scarcity and privacy in dysarthria research by generating synthetic speech via voice cloning on the TORGO dataset. The ElevenLabs platform produced gender-matched synthetic voices for all speakers, which a licensed SLP evaluated for dysarthria, gender, and synthetic indicators. The SLP achieved 100% accuracy for dysarthria detection, 95% for gender, and 70% overall accuracy in distinguishing synthetic from real speech, with 30% of synthetic samples initially misidentified as real, indicating high realism. The work highlights the potential of synthetic data to support diagnostics, rehabilitation, and AI-driven tools in speech-language pathology, and the authors publicly release the synthetic dataset to foster collaboration.

Abstract

This study explores voice cloning to generate synthetic speech replicating the unique patterns of individuals with dysarthria. Using the TORGO dataset, we address data scarcity and privacy challenges in speech-language pathology. Our contributions include demonstrating that voice cloning preserves dysarthric speech characteristics, analyzing differences between real and synthetic data, and discussing implications for diagnostics, rehabilitation, and communication. We cloned voices from dysarthric and control speakers using a commercial platform, ensuring gender-matched synthetic voices. A licensed speech-language pathologist (SLP) evaluated a subset for dysarthria, speaker gender, and synthetic indicators. The SLP correctly identified dysarthria in all cases and speaker gender in 95% but misclassified 30% of synthetic samples as real, indicating high realism. Our results suggest synthetic speech effectively captures disordered characteristics and that voice cloning has advanced to produce high-quality data resembling real speech, even to trained professionals. This has critical implications for healthcare, where synthetic data can mitigate data scarcity, protect privacy, and enhance AI-driven diagnostics. By enabling the creation of diverse, high-quality speech datasets, voice cloning can improve generalizable models, personalize therapy, and advance assistive technologies for dysarthria. We publicly release our synthetic dataset to foster further research and collaboration, aiming to develop robust models that improve patient outcomes in speech-language pathology.

Voice Cloning for Dysarthric Speech Synthesis: Addressing Data Scarcity in Speech-Language Pathology

TL;DR

Addresses data scarcity and privacy in dysarthria research by generating synthetic speech via voice cloning on the TORGO dataset. The ElevenLabs platform produced gender-matched synthetic voices for all speakers, which a licensed SLP evaluated for dysarthria, gender, and synthetic indicators. The SLP achieved 100% accuracy for dysarthria detection, 95% for gender, and 70% overall accuracy in distinguishing synthetic from real speech, with 30% of synthetic samples initially misidentified as real, indicating high realism. The work highlights the potential of synthetic data to support diagnostics, rehabilitation, and AI-driven tools in speech-language pathology, and the authors publicly release the synthetic dataset to foster collaboration.

Abstract

This study explores voice cloning to generate synthetic speech replicating the unique patterns of individuals with dysarthria. Using the TORGO dataset, we address data scarcity and privacy challenges in speech-language pathology. Our contributions include demonstrating that voice cloning preserves dysarthric speech characteristics, analyzing differences between real and synthetic data, and discussing implications for diagnostics, rehabilitation, and communication. We cloned voices from dysarthric and control speakers using a commercial platform, ensuring gender-matched synthetic voices. A licensed speech-language pathologist (SLP) evaluated a subset for dysarthria, speaker gender, and synthetic indicators. The SLP correctly identified dysarthria in all cases and speaker gender in 95% but misclassified 30% of synthetic samples as real, indicating high realism. Our results suggest synthetic speech effectively captures disordered characteristics and that voice cloning has advanced to produce high-quality data resembling real speech, even to trained professionals. This has critical implications for healthcare, where synthetic data can mitigate data scarcity, protect privacy, and enhance AI-driven diagnostics. By enabling the creation of diverse, high-quality speech datasets, voice cloning can improve generalizable models, personalize therapy, and advance assistive technologies for dysarthria. We publicly release our synthetic dataset to foster further research and collaboration, aiming to develop robust models that improve patient outcomes in speech-language pathology.

Paper Structure

This paper contains 20 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Difference Between Uncertainty and Ground Truth for Samples 1-20