Building speech corpus with diverse voice characteristics for its prompt-based representation
Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari
TL;DR
This work addresses the lack of diverse voice characteristics in prompt-based TTS by constructing Coco-Nut, a large open corpus of speech paired with free-form voice-characteristics descriptions sourced from in-the-wild Internet data in Japanese. It introduces a three-stage pipeline (data collection, quality assurance, manual annotation) and a retrieval model that combines contrastive learning with a feature-prediction objective to better align speech with descriptive prompts. Empirical results show improved subjective retrieval quality and higher zero-shot classification accuracy when the feature-prediction loss is used, supporting the value of incorporating perceptual voice features into embeddings. The Coco-Nut corpus and the proposed training approach provide a scalable foundation for controllable, prompt-based TTS and have practical implications for building richly parameterized speech synthesis systems.
Abstract
In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limits the diversity of voice characteristics available. Thus, we aim to address this gap by creating a novel corpus and developing a model for prompt-based manipulation of voice characteristics in text-to-speech synthesis, facilitating a broader range of voice characteristics. Specifically, we propose a method to build a sizable corpus pairing voice characteristics descriptions with corresponding speech samples. This involves automatically gathering voice-related speech data from the Internet, ensuring its quality, and manually annotating it using crowdsourcing. We implement this method with Japanese language data and analyze the results to validate its effectiveness. Subsequently, we propose a construction method of the model to retrieve speech from voice characteristics descriptions based on a contrastive learning method. We train the model using not only conservative contrastive learning but also feature prediction learning to predict quantitative speech features corresponding to voice characteristics. We evaluate the model performance via experiments with the corpus we constructed above.
