VoiceX: A Text-To-Speech Framework for Custom Voices
Silvan Mertes, Daksitha Withanage Don, Otto Grothe, Johanna Kuch, Ruben Schlagowski, Elisabeth André
TL;DR
VoiceX tackles the challenge of making neural TTS voices customizable by non-experts. It introduces a human-in-the-loop Evolution Strategy that navigates the latent speaker-embedding space of a VITS-based TTS model, operating on a PCA-reduced subspace to efficiently reach user-specified voice timbres. The framework includes a web interface and a public Python API for deploying created voices, and is evaluated through a user study (N=65) showing high perceived voice quality (MOS $=4.32$) and strong subjective usability ($=4.57$), with nuanced findings on personality alignment. The work demonstrates a practical, accessible pathway to personalized TTS voices with implications for applications requiring individualized synthetic voices across domains.
Abstract
Modern TTS systems are capable of creating highly realistic and natural-sounding speech. Despite these developments, the process of customizing TTS voices remains a complex task, mostly requiring the expertise of specialists within the field. One reason for this is the utilization of deep learning models, which are characterized by their expansive, non-interpretable parameter spaces, restricting the feasibility of manual customization. In this paper, we present a novel human-in-the-loop paradigm based on an evolutionary algorithm for directly interacting with the parameter space of a neural TTS model. We integrated our approach into a user-friendly graphical user interface that allows users to efficiently create original voices. Those voices can then be used with the backbone TTS model, for which we provide a Python API. Further, we present the results of a user study exploring the capabilities of VoiceX. We show that VoiceX is an appropriate tool for creating individual, custom voices.
