Table of Contents
Fetching ...

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

TL;DR

TTSOps advances corpus construction for multi-speaker TTS by treating training data quality as the primary criterion, and by unifying utterance-level data cleansing switching with evaluation-in-the-loop data selection in a closed loop. By applying this approach to dark, web-scale data (e.g., YouTube), it jointly optimizes which utterances to include and how to cleanse them to maximize downstream TTS performance. Empirical results show improved pseudo MOS and actual MOS correlations, higher counts of high-quality speakers, and greater speaker diversity, with demonstrated cross-lingual generalization and manageable compute costs. The framework is language-agnostic, scalable, and adaptable to future evaluation metrics, offering a principled path from noisy web data to high-quality, diverse TTS corpora.

Abstract

This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

TL;DR

TTSOps advances corpus construction for multi-speaker TTS by treating training data quality as the primary criterion, and by unifying utterance-level data cleansing switching with evaluation-in-the-loop data selection in a closed loop. By applying this approach to dark, web-scale data (e.g., YouTube), it jointly optimizes which utterances to include and how to cleanse them to maximize downstream TTS performance. Empirical results show improved pseudo MOS and actual MOS correlations, higher counts of high-quality speakers, and greater speaker diversity, with demonstrated cross-lingual generalization and manageable compute costs. The framework is language-agnostic, scalable, and adaptable to future evaluation metrics, offering a principled path from noisy web data to high-quality, diverse TTS corpora.

Abstract

This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2) utterance-level dynamic selection of data cleansing methods based on training data quality, and (3) evaluation-in-the-loop data selection using automatically predicted mean opinion scores (MOS) to estimate each utterance's impact on model performance. Furthermore, TTSOps jointly optimizes the corpus and the TTS model in a closed-loop framework by dynamically adapting both data selection and data cleansing processes to the characteristics of the target TTS model. Extensive experiments on Japanese YouTube data demonstrate that TTSOps outperforms conventional acoustic-quality-based baselines in both the naturalness and speaker diversity of synthesized speech.

Paper Structure

This paper contains 49 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Procedure of proposed data selection method based on training data quality. We evaluate training data quality of each utterance through loop of TTS model training and synthetic-speech evaluation. It finally builds a TTS corpus from dark data by utterance-wise filtering.
  • Figure 2: Procedure of TTSOps. We obtain dark data from YouTube and evaluate each utterance through loop of TTS model training and synthetic-speech evaluation. It finally builds TTS corpus from dark data by utterance-wise filtering.
  • Figure 3: Comparison of speaker-wise and utterance-wise selection. With regression, we filter out low-score utterances even if speaker's pseudo MOS is high
  • Figure 4: Comparison between conventional data cleansing procedures and the proposed data cleansing procedure.
  • Figure 5: A method for evaluating the quality of training data for each data cleansing method. The training and evaluation loop is applied to datasets obtained by uniformly applying each data cleansing technique, and the training data quality is assessed.
  • ...and 12 more figures