Table of Contents
Fetching ...

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva

TL;DR

Phoneme Hallucinator tackles the problem of achieving both intelligibility and speaker similarity in one shot voice conversion by introducing a conditional set generative model that hallucinates a diversified yet faithful expansion of the target speaker phoneme representations. The method integrates a principled De Finetti based factorization with a Set Transformer driven prior and posterior, and a conditional VAE for per element generation, enabling unlimited target phoneme samples conditioned on a small target set. This expanded target set is then used in a neighbor based VC pipeline with a pretrained WavLM encoder and a HiFi-GAN vocoder to deliver state of the art performance on a challenging one shot VC task, as demonstrated on LibriSpeech with strong objective and subjective metrics and preliminary cross lingual capabilities. The work provides practical improvements for real world one shot VC and outlines future directions including vocoder adaptation to hallucinated representations and potential conditional diffusion alternatives.

Abstract

Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both intelligibility and speaker similarity.

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

TL;DR

Phoneme Hallucinator tackles the problem of achieving both intelligibility and speaker similarity in one shot voice conversion by introducing a conditional set generative model that hallucinates a diversified yet faithful expansion of the target speaker phoneme representations. The method integrates a principled De Finetti based factorization with a Set Transformer driven prior and posterior, and a conditional VAE for per element generation, enabling unlimited target phoneme samples conditioned on a small target set. This expanded target set is then used in a neighbor based VC pipeline with a pretrained WavLM encoder and a HiFi-GAN vocoder to deliver state of the art performance on a challenging one shot VC task, as demonstrated on LibriSpeech with strong objective and subjective metrics and preliminary cross lingual capabilities. The work provides practical improvements for real world one shot VC and outlines future directions including vocoder adaptation to hallucinated representations and potential conditional diffusion alternatives.

Abstract

Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content. Existing methods suffer from a dilemma between content intelligibility and speaker similarity; i.e., methods with higher intelligibility usually have a lower speaker similarity, while methods with higher speaker similarity usually require plenty of target speaker voice data to achieve high intelligibility. In this work, we propose a novel method \textit{Phoneme Hallucinator} that achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model; it adopts a novel model to hallucinate diversified and high-fidelity target speaker phonemes based just on a short target speaker voice (e.g. 3 seconds). The hallucinated phonemes are then exploited to perform neighbor-based voice conversion. Our model is a text-free, any-to-any VC model that requires no text annotations and supports conversion to any unseen speaker. Objective and subjective evaluations show that \textit{Phoneme Hallucinator} outperforms existing VC methods for both intelligibility and speaker similarity.
Paper Structure (23 sections, 6 equations, 4 figures, 3 tables)

This paper contains 23 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison between kNN-VC (SOTA method) and our proposed Phoneme Hallucinator on the LibriSpeech test-clean split with varying target voice duration. Word Error Rate (WER$\downarrow$) and Speaker Similarity ($\uparrow$) respectively measure intelligibility and speaker similarity. Shadowed areas indicate standard deviations of WER and Speaker Similarity computed on the test-clean split.
  • Figure 2: The VC pipeline of our method. A pre-trained WavLM model chen2022wavlm extracts the source representation sequence (green) from the source voice and the target representation set (yellow) from the target voice respectively. Then, the target set is expanded by our hallucinator. Afterward, every source representation is replaced by its neighbors in the expanded target set, resulting in the converted sequence (pink). Finally, a pre-trained vocoder transforms the converted sequence to voice.
  • Figure 3: The detailed structure of the hallucinator. Posterior Permutation Invariant Nets, Prior Permutation Invariant Nets, and Permutation Equivariant Nets are all implemented by Set Transformer. Conditional Encoder and Conditional Decoder are implemented by multilayer perceptrons (MLP).
  • Figure 4: T-SNE visualization of three randomly chosen expanded sets. In each subplot, there are 100 given target speech representations (red) extracted from a 2-second utterance from a speaker in LibriSpeech test-clean split, which equals 2 seconds of speech. Conditioned on the given representations, our model hallucinates 2,000 new representations (blue).