Table of Contents
Fetching ...

The Zero Resource Speech Challenge 2019: TTS without T

Ewan Dunbar, Robin Algayres, Julien Karadayi, Mathieu Bernard, Juan Benjumea, Xuan-Nga Cao, Lucie Miskic, Charlotte Dugrain, Lucas Ondel, Alan W. Black, Laurent Besacier, Sakriani Sakti, Emmanuel Dupoux

TL;DR

The paper presents the Zero Resource Speech Challenge 2019, which tasks participants with building a text-to-speech system without any text or phonetic labels by unsupervised discovery of subword units and alignment to a target voice. It defines datasets, evaluation metrics (CER, MOS, ABX, bitrate), baselines, and toplines, and reports results from 19 submissions across 10 teams, showing that TTS without text is feasible but remains difficult when learned representations are very text-like. Key findings include a monotonic relationship between ABX discriminability and synthesis quality, and that higher bitrate embeddings can improve performance but discretization challenges persist. The work highlights the potential and limits of purely unsupervised units for low-resource TTS and suggests directions for improving low-bitrate representations and decoding strategies.

Abstract

We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation, a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 10 teams and discuss the main results.

The Zero Resource Speech Challenge 2019: TTS without T

TL;DR

The paper presents the Zero Resource Speech Challenge 2019, which tasks participants with building a text-to-speech system without any text or phonetic labels by unsupervised discovery of subword units and alignment to a target voice. It defines datasets, evaluation metrics (CER, MOS, ABX, bitrate), baselines, and toplines, and reports results from 19 submissions across 10 teams, showing that TTS without text is feasible but remains difficult when learned representations are very text-like. Key findings include a monotonic relationship between ABX discriminability and synthesis quality, and that higher bitrate embeddings can improve performance but discretization challenges persist. The work highlights the potential and limits of purely unsupervised units for low-resource TTS and suggests directions for improving low-bitrate representations and decoding strategies.

Abstract

We present the Zero Resource Speech Challenge 2019, which proposes to build a speech synthesizer without any text or phonetic labels: hence, TTS without T (text-to-speech without text). We provide raw audio for a target voice in an unknown language (the Voice dataset), but no alignment, text or labels. Participants must discover subword units in an unsupervised way (using the Unit Discovery dataset) and align them to the voice recordings in a way that works best for the purpose of synthesizing novel utterances from novel speakers, similar to the target speaker's voice. We describe the metrics used for evaluation, a baseline system consisting of unsupervised subword unit discovery plus a standard TTS system, and a topline TTS using gold phoneme transcriptions. We present an overview of the 19 submitted systems from 10 teams and discuss the main results.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Schematic diagram of the challenge.
  • Figure 2: Embedding quality as a function of the log bitrate for the development language (left) and the surprise language (right). Light grey boxes represent auxiliary embeddings; dark grey boxes are our reference scores. Lower left is better.
  • Figure 3: CER as a function of the log bitrate for the development language (left) and the surprise language (right). Dark grey boxes are our reference scores. Lower left is better. Dotted lines are CER on original recordings.