Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent
Yusheng Tian, Junbin Liu, Tan Lee
TL;DR
This work addresses personalized voice synthesis without target recordings by coupling a speech resynthesis framework with a human-in-the-loop, coordinate-descent search over PCA-derived speaker-embedding coefficients. The embedding space is shown to be interpretable, with principal directions largely aligned to perceptual attributes like pitch and timbre, enabling intuitive user-driven refinements. Through computer simulations and a user study, the method demonstrates high similarity to target voices for in-domain data and provides insights into limitations when targeting out-of-domain voices due to training data constraints. The approach offers a practical pathway for voiceless individuals to regain a recognizable voice and suggests future extensions to initialization strategies and multilingual settings.
Abstract
This paper describes a human-in-the-loop approach to personalized voice synthesis in the absence of reference speech data from the target speaker. It is intended to help vocally disabled individuals restore their lost voices without requiring any prior recordings. The proposed approach leverages a learned speaker embedding space. Starting from an initial voice, users iteratively refine the speaker embedding parameters through a coordinate descent-like process, guided by auditory perception. By analyzing the latent space, it is noted that that the embedding parameters correspond to perceptual voice attributes, including pitch, vocal tension, brightness, and nasality, making the search process intuitive. Computer simulations and real-world user studies demonstrate that the proposed approach is effective in approximating target voices across a diverse range of test cases.
