Table of Contents
Fetching ...

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

Huu Tuong Tu, Huan Vu, cuong tien nguyen, Dien Hy Ngo, Nguyen Thi Thu Trang

TL;DR

O_O-VC introduces synthetic data-driven one-to-one alignment for any-to-any voice conversion by training on pairs of synthetic utterances sharing identical linguistic content but different speakers. Using a multispeaker TTS (VITS) to generate paired data, the method learns direct input-output mappings, avoiding heavy disentanglement and reconstruction burdens. A two-phase training regime—synthetic-data pretraining followed by real-data fine-tuning—enables robust zero-shot generalization to unseen speakers and languages, achieving notable improvements in intelligibility ($WER$/$CER$) and speaker similarity ($SECS$). Semantic alignment analyses confirm precise frame-level correspondence between synthetic pairs, and ablations demonstrate the critical roles of synthetic data, $F0$ conditioning, and phase-2 adaptation in reducing speaker leakage. The approach shows strong potential for practical, language-agnostic VC, while highlighting dependence on high-quality TTS and opportunities for exploring alternative TTS backbones and ethical safeguards.

Abstract

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, synthetic data pairs that share the same linguistic content but differ in speaker identity are used as input-output pairs to train the voice conversion model. This enables the model to learn a direct mapping between source and target voices, effectively capturing speaker-specific characteristics while preserving linguistic content. Additionally, we introduce a flexible training strategy for any-to-any voice conversion that generalizes well to unseen speakers and new languages, enhancing adaptability and performance in zero-shot scenarios. Our experiments show that our proposed method achieves a 16.35% relative reduction in word error rate and a 5.91% improvement in speaker cosine similarity, outperforming several state-of-the-art methods. Voice conversion samples can be accessed at: https://oovc-emnlp-2025.github.io/

O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion

TL;DR

O_O-VC introduces synthetic data-driven one-to-one alignment for any-to-any voice conversion by training on pairs of synthetic utterances sharing identical linguistic content but different speakers. Using a multispeaker TTS (VITS) to generate paired data, the method learns direct input-output mappings, avoiding heavy disentanglement and reconstruction burdens. A two-phase training regime—synthetic-data pretraining followed by real-data fine-tuning—enables robust zero-shot generalization to unseen speakers and languages, achieving notable improvements in intelligibility (/) and speaker similarity (). Semantic alignment analyses confirm precise frame-level correspondence between synthetic pairs, and ablations demonstrate the critical roles of synthetic data, conditioning, and phase-2 adaptation in reducing speaker leakage. The approach shows strong potential for practical, language-agnostic VC, while highlighting dependence on high-quality TTS and opportunities for exploring alternative TTS backbones and ethical safeguards.

Abstract

Traditional voice conversion (VC) methods typically attempt to separate speaker identity and linguistic information into distinct representations, which are then combined to reconstruct the audio. However, effectively disentangling these factors remains challenging, often leading to information loss during training. In this paper, we propose a new approach that leverages synthetic speech data generated by a high-quality, pretrained multispeaker text-to-speech (TTS) model. Specifically, synthetic data pairs that share the same linguistic content but differ in speaker identity are used as input-output pairs to train the voice conversion model. This enables the model to learn a direct mapping between source and target voices, effectively capturing speaker-specific characteristics while preserving linguistic content. Additionally, we introduce a flexible training strategy for any-to-any voice conversion that generalizes well to unseen speakers and new languages, enhancing adaptability and performance in zero-shot scenarios. Our experiments show that our proposed method achieves a 16.35% relative reduction in word error rate and a 5.91% improvement in speaker cosine similarity, outperforming several state-of-the-art methods. Voice conversion samples can be accessed at: https://oovc-emnlp-2025.github.io/

Paper Structure

This paper contains 26 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Voice conversion with synthetic data.
  • Figure 2: T-SNE visualization of speaker-independent features. More distributed points with no clusters indicate better speaker independence.
  • Figure 3: Comparison of systems on F0-PCC
  • Figure 4: Performance of new language adaptation: CER for Chinese, WER for Vietnamese and Italian.
  • Figure 5: Semantic alignment of source and target audio via synthetic data.
  • ...and 1 more figures