Table of Contents
Fetching ...

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Anders R. Bargum, Simon Lajboschitz, Cumhur Erkut

TL;DR

This work targets real-time voice conversion at high sampling rates by extending the RAVE framework with speech-content representations. It replaces the variational encoder with a FiLM-conditioned auto-encoder, guided by a HuBERT-based teacher and enhanced by information perturbation and a speaker embedding to achieve content–speaker disentanglement; a multi-discriminator setup with STFT-based losses enforces naturalness and intelligibility. Trained on a $64$-dimensional latent representation sampled at about $47$ Hz and using a $16$-band PQMF, the system demonstrates competitive naturalness and intelligibility for seen speakers while delivering substantial inference speed advantages over diffusion-based baselines. However, similarity to unseen target speakers remains a bottleneck, indicating room for improvement in zero-shot generalization. Overall, S-RAVE demonstrates the feasibility of end-to-end, time-domain VC at high sampling rates suitable for DAWs and real-time audio design contexts, enabling practical live-voice design and audio-effect applications.

Abstract

Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

TL;DR

This work targets real-time voice conversion at high sampling rates by extending the RAVE framework with speech-content representations. It replaces the variational encoder with a FiLM-conditioned auto-encoder, guided by a HuBERT-based teacher and enhanced by information perturbation and a speaker embedding to achieve content–speaker disentanglement; a multi-discriminator setup with STFT-based losses enforces naturalness and intelligibility. Trained on a -dimensional latent representation sampled at about Hz and using a -band PQMF, the system demonstrates competitive naturalness and intelligibility for seen speakers while delivering substantial inference speed advantages over diffusion-based baselines. However, similarity to unseen target speakers remains a bottleneck, indicating room for improvement in zero-shot generalization. Overall, S-RAVE demonstrates the feasibility of end-to-end, time-domain VC at high sampling rates suitable for DAWs and real-time audio design contexts, enabling practical live-voice design and audio-effect applications.

Abstract

Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.
Paper Structure (18 sections, 8 equations, 2 figures, 2 tables)

This paper contains 18 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The proposed pipeline including our speech-related extensions to the original RAVE model. Dotted lines represent parts included exclusively while training i.e. losses and input speaker embedding, while striped lines represent aspects added for inference i.e external target audio samples.
  • Figure 2: t-SNE visualization of the content and speaker embeddings of utterances from seen and unseen speakers. Each color represents a distinct speaker, either represented by the spoken content (a) and (c) or identity/timbre characteristics (b) and (d).