Building speech corpus with diverse voice characteristics for its prompt-based representation

Aya Watanabe; Shinnosuke Takamichi; Yuki Saito; Wataru Nakata; Detai Xin; Hiroshi Saruwatari

Building speech corpus with diverse voice characteristics for its prompt-based representation

Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

TL;DR

This work addresses the lack of diverse voice characteristics in prompt-based TTS by constructing Coco-Nut, a large open corpus of speech paired with free-form voice-characteristics descriptions sourced from in-the-wild Internet data in Japanese. It introduces a three-stage pipeline (data collection, quality assurance, manual annotation) and a retrieval model that combines contrastive learning with a feature-prediction objective to better align speech with descriptive prompts. Empirical results show improved subjective retrieval quality and higher zero-shot classification accuracy when the feature-prediction loss is used, supporting the value of incorporating perceptual voice features into embeddings. The Coco-Nut corpus and the proposed training approach provide a scalable foundation for controllable, prompt-based TTS and have practical implications for building richly parameterized speech synthesis systems.

Abstract

In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limits the diversity of voice characteristics available. Thus, we aim to address this gap by creating a novel corpus and developing a model for prompt-based manipulation of voice characteristics in text-to-speech synthesis, facilitating a broader range of voice characteristics. Specifically, we propose a method to build a sizable corpus pairing voice characteristics descriptions with corresponding speech samples. This involves automatically gathering voice-related speech data from the Internet, ensuring its quality, and manually annotating it using crowdsourcing. We implement this method with Japanese language data and analyze the results to validate its effectiveness. Subsequently, we propose a construction method of the model to retrieve speech from voice characteristics descriptions based on a contrastive learning method. We train the model using not only conservative contrastive learning but also feature prediction learning to predict quantitative speech features corresponding to voice characteristics. We evaluate the model performance via experiments with the corpus we constructed above.

Building speech corpus with diverse voice characteristics for its prompt-based representation

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 15 figures, 5 tables)

This paper contains 31 sections, 4 equations, 15 figures, 5 tables.

Introduction
Related work
Sequence generation from text
Dataset for text-to-image
Dataset for text-to-audio and text-to-music
Contrastive learning for text-audio
Corpus construction
Data collection
Video filtering
Quality assurance
Audio quality
Content quality
Manual annotation
Corpus construction settings and results
Data collection
...and 16 more sections

Figures (15)

Figure 1: Our Coco-Nut corpus towards prompt-based TTS. Voice characteristics description and reading text are, for example, "middle-aged man's voice speaking in a clear and polite tone" and "Welcome to our office!", respectively. A speech synthesizer synthesizes the speech of the prompted content on the basis of the prompted voice characteristics.
Figure 2: CLAP overview. Embedding vectors from text and audio are learned by contrastive learning.
Figure 3: Procedure of corpus construction.
Figure 4: Histogram of NISQA-predicted MOS on speech quality.
Figure 5: Histogram of MLM scores.
...and 10 more figures

Building speech corpus with diverse voice characteristics for its prompt-based representation

TL;DR

Abstract

Building speech corpus with diverse voice characteristics for its prompt-based representation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)