Table of Contents
Fetching ...

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss

Yusheng Tian, Jingyu Li, Tan Lee

TL;DR

This work addresses personalized TTS for articulation-impaired speech from tongue-cancer patients by introducing an augmented reconstruction loss that combines a standard reconstruction term with regularization and consistency penalties derived from a separately trained phone classifier. The total objective $L_{total}=L_{rec}+\beta L_{reg}+\gamma L_{consis}$ simultaneously mitigates distortions from articulation impairment and enforces correct articulation via a frame-level articulation score and a KL-based consistency term. Experiments on a real Cantonese speaker demonstrate that the proposed loss yields intelligible speech with articulation quality comparable to unimpaired speech and VC-based baselines, while better preserving the target speaker’s identity. The approach shows practical potential for post-surgical articulation rehabilitation and could extend to other articulation disorders across languages. Key formulas include the impairment-aware regularization $L_{reg}$ driven by $\alpha_t \propto \exp(-\lambda p^*_t)$ and the consistency loss $L_{consis}=D_{KL}(P_{tts}\Vert P_{cls})$, embedded within the overall training objective.

Abstract

This research is about the creation of personalized synthetic voices for head and neck cancer survivors. It is focused particularly on tongue cancer patients whose speech might exhibit severe articulation impairment. Our goal is to restore normal articulation in the synthesized speech, while maximally preserving the target speaker's individuality in terms of both the voice timbre and speaking style. This is formulated as a task of learning from noisy labels. We propose to augment the commonly used speech reconstruction loss with two additional terms. The first term constitutes a regularization loss that mitigates the impact of distorted articulation in the training speech. The second term is a consistency loss that encourages correct articulation in the generated speech. These additional loss terms are obtained from frame-level articulation scores of original and generated speech, which are derived using a separately trained phone classifier. Experimental results on a real case of tongue cancer patient confirm that the synthetic voice achieves comparable articulation quality to unimpaired natural speech, while effectively maintaining the target speaker's individuality. Audio samples are available at https://myspeechproject.github.io/ArticulationRepair/.

Creating Personalized Synthetic Voices from Articulation Impaired Speech Using Augmented Reconstruction Loss

TL;DR

This work addresses personalized TTS for articulation-impaired speech from tongue-cancer patients by introducing an augmented reconstruction loss that combines a standard reconstruction term with regularization and consistency penalties derived from a separately trained phone classifier. The total objective simultaneously mitigates distortions from articulation impairment and enforces correct articulation via a frame-level articulation score and a KL-based consistency term. Experiments on a real Cantonese speaker demonstrate that the proposed loss yields intelligible speech with articulation quality comparable to unimpaired speech and VC-based baselines, while better preserving the target speaker’s identity. The approach shows practical potential for post-surgical articulation rehabilitation and could extend to other articulation disorders across languages. Key formulas include the impairment-aware regularization driven by and the consistency loss , embedded within the overall training objective.

Abstract

This research is about the creation of personalized synthetic voices for head and neck cancer survivors. It is focused particularly on tongue cancer patients whose speech might exhibit severe articulation impairment. Our goal is to restore normal articulation in the synthesized speech, while maximally preserving the target speaker's individuality in terms of both the voice timbre and speaking style. This is formulated as a task of learning from noisy labels. We propose to augment the commonly used speech reconstruction loss with two additional terms. The first term constitutes a regularization loss that mitigates the impact of distorted articulation in the training speech. The second term is a consistency loss that encourages correct articulation in the generated speech. These additional loss terms are obtained from frame-level articulation scores of original and generated speech, which are derived using a separately trained phone classifier. Experimental results on a real case of tongue cancer patient confirm that the synthetic voice achieves comparable articulation quality to unimpaired natural speech, while effectively maintaining the target speaker's individuality. Audio samples are available at https://myspeechproject.github.io/ArticulationRepair/.
Paper Structure (13 sections, 5 equations, 2 figures, 2 tables)