Table of Contents
Fetching ...

SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion

Zhao Guo, Ziqian Ning, Guobin Ma, Lei Xie

TL;DR

SynthVC tackles real-time streaming voice conversion by removing reliance on content-speaker disentanglement and external linguistic features. It leverages a neural codec backbone (AudioDec) and a latent-space Converter to perform direct timbre mapping, trained with synthetic parallel data generated from a pre-trained zero-shot VC model Seed-VC. A two-stage training regime with latent alignment and adversarial refinement yields state-of-the-art naturalness and speaker similarity while achieving an end-to-end latency of 77.1 ms. The approach demonstrates that synthetic data can effectively substitute for non-parallel data and ASR-based features in streaming VC, enabling efficient, high-fidelity waveform-to-waveform conversion for real-time use.

Abstract

Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.

SynthVC: Leveraging Synthetic Data for End-to-End Low Latency Streaming Voice Conversion

TL;DR

SynthVC tackles real-time streaming voice conversion by removing reliance on content-speaker disentanglement and external linguistic features. It leverages a neural codec backbone (AudioDec) and a latent-space Converter to perform direct timbre mapping, trained with synthetic parallel data generated from a pre-trained zero-shot VC model Seed-VC. A two-stage training regime with latent alignment and adversarial refinement yields state-of-the-art naturalness and speaker similarity while achieving an end-to-end latency of 77.1 ms. The approach demonstrates that synthetic data can effectively substitute for non-parallel data and ASR-based features in streaming VC, enabling efficient, high-fidelity waveform-to-waveform conversion for real-time use.

Abstract

Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.

Paper Structure

This paper contains 23 sections, 8 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overall framework of SynthVC consists of a two-stage training strategy.
  • Figure 2: Waveform-level training uses synthetic waveforms as input and original waveforms as supervision. This approach can lead to over-smoothing and loss of audio detail.
  • Figure 3: Spectrogram comparison of converted audio across ablation variants and SynthVC. (a) and (b) show results after only Stage 1 training. Removing the Converter (a) leads to blurred high-frequency details, while SynthVC (b) preserves spectral fidelity. (c) and (d) correspond to Stage 2 training. Omitting alignment between training and inference (c) introduces spectral artifacts, whereas SynthVC (d) eliminates such artifacts and maintains high-frequency detail.