Table of Contents
Fetching ...

Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis

Haoshen Wang, Xueli Zhong, Bingbing Lin, Jia Huang, Xingduo Pan, Shengxiang Liang, Nizhuan Wang, Wai Ting Siok

TL;DR

This work tackles the variability and data scarcity of dysarthric speech by introducing ProtoDisent-TTS, a prototype-based disentanglement TTS that separates speaker timbre from pathological articulation within a unified latent space. A pathology prototype codebook and a gradient reversal–based adversarial setup enforce invariant speaker embeddings and interpretable articulation controls, enabling bidirectional healthy/dysarthric transformation via a joint representation z = s + p_k fed into a pre-trained Index-TTS backbone. The method achieves reliable dysarthric-to-healthy reconstruction while simultaneously supporting scalable ASR data augmentation through healthy-to-dysarthric synthesis, with experiments on the TORGO dataset showing improved ASR performance and strong speaker identity preservation. The approach offers a practical, scalable path for improving dysarthric speech recognition and assistive synthesis by providing controllable, pathology-informed speech generation.

Abstract

Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.

Prototype-Based Disentanglement for Controllable Dysarthric Speech Synthesis

TL;DR

This work tackles the variability and data scarcity of dysarthric speech by introducing ProtoDisent-TTS, a prototype-based disentanglement TTS that separates speaker timbre from pathological articulation within a unified latent space. A pathology prototype codebook and a gradient reversal–based adversarial setup enforce invariant speaker embeddings and interpretable articulation controls, enabling bidirectional healthy/dysarthric transformation via a joint representation z = s + p_k fed into a pre-trained Index-TTS backbone. The method achieves reliable dysarthric-to-healthy reconstruction while simultaneously supporting scalable ASR data augmentation through healthy-to-dysarthric synthesis, with experiments on the TORGO dataset showing improved ASR performance and strong speaker identity preservation. The approach offers a practical, scalable path for improving dysarthric speech recognition and assistive synthesis by providing controllable, pathology-informed speech generation.

Abstract

Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
Paper Structure (13 sections, 3 equations, 2 figures, 3 tables)