Table of Contents
Fetching ...

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling

TL;DR

<3-5 sentence high-level summary> VC 2018 advances the evaluation of speaker-identity voice conversion by introducing parallel (Hub) and non-parallel (Spoke) tasks on a common DAPS-based dataset, coupled with large-scale crowdsourced perceptual tests for naturalness and similarity. The study demonstrates substantial progress with neural approaches (e.g., N10) and neural vocoders (WaveNet), while also revealing persistent challenges in cross-gender and non-parallel settings through perceptual drops and WER correlations. A baseline suite and a detailed participant description enable fair comparison and reproducibility, and the work makes the data and results publicly accessible for ongoing research. Spoofing analyses are acknowledged as a separate line of inquiry to assess security implications.

Abstract

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems. The objective of the challenge was to perform speaker conversion (i.e. transform the vocal identity) of a source speaker to a target speaker while maintaining linguistic information. As an update to the previous challenge, we considered both parallel and non-parallel data to form the Hub and Spoke tasks, respectively. A total of 23 teams from around the world submitted their systems, 11 of them additionally participated in the optional Spoke task. A large-scale crowdsourced perceptual evaluation was then carried out to rate the submitted converted speech in terms of naturalness and similarity to the target speaker identity. In this paper, we present a brief summary of the state-of-the-art techniques for VC, followed by a detailed explanation of the challenge tasks and the results that were obtained.

The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

TL;DR

<3-5 sentence high-level summary> VC 2018 advances the evaluation of speaker-identity voice conversion by introducing parallel (Hub) and non-parallel (Spoke) tasks on a common DAPS-based dataset, coupled with large-scale crowdsourced perceptual tests for naturalness and similarity. The study demonstrates substantial progress with neural approaches (e.g., N10) and neural vocoders (WaveNet), while also revealing persistent challenges in cross-gender and non-parallel settings through perceptual drops and WER correlations. A baseline suite and a detailed participant description enable fair comparison and reproducibility, and the work makes the data and results publicly accessible for ongoing research. Spoofing analyses are acknowledged as a separate line of inquiry to assess security implications.

Abstract

We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems. The objective of the challenge was to perform speaker conversion (i.e. transform the vocal identity) of a source speaker to a target speaker while maintaining linguistic information. As an update to the previous challenge, we considered both parallel and non-parallel data to form the Hub and Spoke tasks, respectively. A total of 23 teams from around the world submitted their systems, 11 of them additionally participated in the optional Spoke task. A large-scale crowdsourced perceptual evaluation was then carried out to rate the submitted converted speech in terms of naturalness and similarity to the target speaker identity. In this paper, we present a brief summary of the state-of-the-art techniques for VC, followed by a detailed explanation of the challenge tasks and the results that were obtained.

Paper Structure

This paper contains 23 sections, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Naturalness results of the Hub task for all speaker pairs. MOS scores are averaged across all pairs, arranged in accordance with their mean (red dot).
  • Figure 2: Naturalness results of the Hub task for same-gender conversion pairs. MOS scores are averaged across all pairs, arranged in accordance with their mean (red dot).
  • Figure 3: Naturalness results of the Hub task for cross-gender speaker pairs. MOS scores are averaged across all pairs, arranged in accordance with their mean (red dot).
  • Figure 4: Similarity results of the target speaker for the Hub task averaged across all speaker pairs.
  • Figure 5: Scatter plot matching naturalness and similarity scores to target speaker for the Hub task when averaging all speaker pairs.
  • ...and 10 more figures