Table of Contents
Fetching ...

Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent

Yusheng Tian, Junbin Liu, Tan Lee

TL;DR

This work addresses personalized voice synthesis without target recordings by coupling a speech resynthesis framework with a human-in-the-loop, coordinate-descent search over PCA-derived speaker-embedding coefficients. The embedding space is shown to be interpretable, with principal directions largely aligned to perceptual attributes like pitch and timbre, enabling intuitive user-driven refinements. Through computer simulations and a user study, the method demonstrates high similarity to target voices for in-domain data and provides insights into limitations when targeting out-of-domain voices due to training data constraints. The approach offers a practical pathway for voiceless individuals to regain a recognizable voice and suggests future extensions to initialization strategies and multilingual settings.

Abstract

This paper describes a human-in-the-loop approach to personalized voice synthesis in the absence of reference speech data from the target speaker. It is intended to help vocally disabled individuals restore their lost voices without requiring any prior recordings. The proposed approach leverages a learned speaker embedding space. Starting from an initial voice, users iteratively refine the speaker embedding parameters through a coordinate descent-like process, guided by auditory perception. By analyzing the latent space, it is noted that that the embedding parameters correspond to perceptual voice attributes, including pitch, vocal tension, brightness, and nasality, making the search process intuitive. Computer simulations and real-world user studies demonstrate that the proposed approach is effective in approximating target voices across a diverse range of test cases.

Personalized Voice Synthesis through Human-in-the-Loop Coordinate Descent

TL;DR

This work addresses personalized voice synthesis without target recordings by coupling a speech resynthesis framework with a human-in-the-loop, coordinate-descent search over PCA-derived speaker-embedding coefficients. The embedding space is shown to be interpretable, with principal directions largely aligned to perceptual attributes like pitch and timbre, enabling intuitive user-driven refinements. Through computer simulations and a user study, the method demonstrates high similarity to target voices for in-domain data and provides insights into limitations when targeting out-of-domain voices due to training data constraints. The approach offers a practical pathway for voiceless individuals to regain a recognizable voice and suggests future extensions to initialization strategies and multilingual settings.

Abstract

This paper describes a human-in-the-loop approach to personalized voice synthesis in the absence of reference speech data from the target speaker. It is intended to help vocally disabled individuals restore their lost voices without requiring any prior recordings. The proposed approach leverages a learned speaker embedding space. Starting from an initial voice, users iteratively refine the speaker embedding parameters through a coordinate descent-like process, guided by auditory perception. By analyzing the latent space, it is noted that that the embedding parameters correspond to perceptual voice attributes, including pitch, vocal tension, brightness, and nasality, making the search process intuitive. Computer simulations and real-world user studies demonstrate that the proposed approach is effective in approximating target voices across a diverse range of test cases.
Paper Structure (12 sections, 4 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the proposed system. Per-utterance pitch normalization is applied to encode both the timbre and overall pitch information into the speaker embedding.
  • Figure 2: Illustration of the search process. Left: the user interface for a single query. Right: an example search sequence within a 2-dimensional parameter space.
  • Figure 3: A real user session for the LibriTTS-R "easy" target. Left: UMAP-projected speaker embeddings, extracted using Resemblyzer. Right: Surrogate objective function value of user-selected voice candidates after each query.
  • Figure 4: Visualization of how the generated mel-spectrogram changes when the speaker embedding is shifted along the five principal voice editing directions. The selected speech sample is 6345_93302_000037_000003.wav from the LibriTTS-R dev-clean set, a female voice speaking "The setting of the scene seemed to her all important".
  • Figure 5: Listener responses of their perceived changes in voice attributes when the speaker embedding is manipulated along a single editing direction.
  • ...and 1 more figures