Table of Contents
Fetching ...

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

Paarth Neekhara, Shehzeen Hussain, Rafael Valle, Boris Ginsburg, Rishabh Ranjan, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley

TL;DR

The paper addresses zero-shot voice conversion without text transcriptions by training on imperfectly disentangled, SSL-derived speech representations. It introduces SelfVC, which fuses a Conformer-SSL content encoder with a speaker verifier and a pitch-duration aware synthesizer, and employs a novel self-transformation training loop that uses the model itself to generate challenging augmentation samples. Key contributions include a duration-augmented content representation, a self-transformations strategy that improves speaker similarity, and state-of-the-art zero-shot and cross-lingual voice conversion results in a text-free setting. The approach demonstrates strong generalization to unseen languages and languages with limited resources, highlighting practical potential for multilingual VC without transcriptions or phonetic posteriors.

Abstract

We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on factorizing speech into explicitly disentangled representations that separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

TL;DR

The paper addresses zero-shot voice conversion without text transcriptions by training on imperfectly disentangled, SSL-derived speech representations. It introduces SelfVC, which fuses a Conformer-SSL content encoder with a speaker verifier and a pitch-duration aware synthesizer, and employs a novel self-transformation training loop that uses the model itself to generate challenging augmentation samples. Key contributions include a duration-augmented content representation, a self-transformations strategy that improves speaker similarity, and state-of-the-art zero-shot and cross-lingual voice conversion results in a text-free setting. The approach demonstrates strong generalization to unseen languages and languages with limited resources, highlighting practical potential for multilingual VC without transcriptions or phonetic posteriors.

Abstract

We propose SelfVC, a training strategy to iteratively improve a voice conversion model with self-synthesized examples. Previous efforts on voice conversion focus on factorizing speech into explicitly disentangled representations that separately encode speaker characteristics and linguistic content. However, disentangling speech representations to capture such attributes using task-specific loss terms can lead to information loss. In this work, instead of explicitly disentangling attributes with loss terms, we present a framework to train a controllable voice conversion model on entangled speech representations derived from self-supervised learning (SSL) and speaker verification models. First, we develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Next, we propose a training strategy to iteratively improve the synthesis model for voice conversion, by creating a challenging training objective using self-synthesized examples. We demonstrate that incorporating such self-synthesized examples during training improves the speaker similarity of generated speech as compared to a baseline voice conversion model trained solely on heuristically perturbed inputs. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
Paper Structure (18 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: SelfVC Overview: The synthesizer $G_{\textit{synth}}$ is trained to reconstruct the mel-spectrogram from SSL-based content representation of a transformed audio and speaker embedding of the original audio. The transformation function is either a heuristic transform or a voice-converted audio generated using self-synthesis with a different speaker embedding.
  • Figure 2: (a) Feature Extraction: The feature extractor derives the duration augmented content information from an SSL model, pitch contour using PYin algorithm and speaker embedding from a speaker verification model. (b) Mel Spectrogram Synthesizer: reconstructs the mel-spectrogram from the derived features.
  • Figure 3: Left: SV-EER of voice converted speech generated by SelfVC using different amounts of target speaker data for estimating the speaker embedding. Right: t-SNE visualization of speaker embeddings of SelfVC synthesized and ground-truth audio for $10$ target speakers. Each color represents a different speaker.
  • Figure 4: Phoneme Error Rate on Individual Languages of the CSS10 dataset for voice conversion experiments when the source utterance is from CSS10 and the target speaker is from another language in CSS10 or the LibriTTS test-clean dataset.
  • Figure 5: User Study template used for Naturalness MOS evaluation
  • ...and 1 more figures