Table of Contents
Fetching ...

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen

TL;DR

This work tackles pronunciation accuracy, speaking style consistency, and timbre stability in token-based TTS under zero-shot conditions by introducing HAM-TTS, a hierarchical acoustic modeling framework. It combines a Text-to-LVS predictor with a Text-HuBERT aligner, refines HuBERT features with K-Means, and employs timbre-consistency data augmentation alongside a large synthetic dataset generated by a UNet-based few-shot voice converter to enable extensive real+synthetic training. The approach yields improved pronunciation and timbre fidelity over VALL-E on unseen AISHELL1 data, with HAM-TTS-L approaching ground-truth quality and robust zero-shot performance. These results suggest that structured latent acoustic representations, coupled with targeted data strategies, can substantially advance zero-shot TTS in data-rich and data-scarce regimes alike.

Abstract

Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.

HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling

TL;DR

This work tackles pronunciation accuracy, speaking style consistency, and timbre stability in token-based TTS under zero-shot conditions by introducing HAM-TTS, a hierarchical acoustic modeling framework. It combines a Text-to-LVS predictor with a Text-HuBERT aligner, refines HuBERT features with K-Means, and employs timbre-consistency data augmentation alongside a large synthetic dataset generated by a UNet-based few-shot voice converter to enable extensive real+synthetic training. The approach yields improved pronunciation and timbre fidelity over VALL-E on unseen AISHELL1 data, with HAM-TTS-L approaching ground-truth quality and robust zero-shot performance. These results suggest that structured latent acoustic representations, coupled with targeted data strategies, can substantially advance zero-shot TTS in data-rich and data-scarce regimes alike.

Abstract

Token-based text-to-speech (TTS) models have emerged as a promising avenue for generating natural and realistic speech, yet they grapple with low pronunciation accuracy, speaking style and timbre inconsistency, and a substantial need for diverse training data. In response, we introduce a novel hierarchical acoustic modeling approach complemented by a tailored data augmentation strategy and train it on the combination of real and synthetic data, scaling the data size up to 650k hours, leading to the zero-shot TTS model with 0.8B parameters. Specifically, our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor. This significantly mitigates pronunciation errors and style mutations in synthesized speech. During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity. Moreover, a pretrained few-shot voice conversion model is utilized to generate a plethora of voices with identical content yet varied timbres. This facilitates the explicit learning of utterance-level one-to-many mappings, enriching speech diversity and also ensuring consistency in timbre. Comparative experiments (Demo page: https://anonymous.4open.science/w/ham-tts/)demonstrate our model's superiority over VALL-E in pronunciation precision and maintaining speaking style, as well as timbre continuity.
Paper Structure (19 sections, 9 equations, 4 figures, 6 tables)

This paper contains 19 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of HAM-TTS. Although it builds upon VALL-E, its design including Text-HuBERT aligner and Text-to-LVS is applicable across various token-based TTS models. To enhance the ability of HAM-TTS to process semantic information, we also let codec language models predict the phoneme sequence based on the input text in the training stage.
  • Figure 2: Structure of Text-to-LVS predictor. "DP" means dropout dropout operation. It learns the mapping from the text prompt to the LVS in the training stage. Once the training is complete, it can generate the LVS from the text prompt directly in the inference stage.
  • Figure 3: Structure of Text-HuBERT aligner. It utilizes the text prompt and the refined HuBERT feature as input to generate the LVS in the training stage. The generated LVS is also used as a supervising signal to train the Text-to-LVS predictor.
  • Figure 4: Structure of UNet-based voice conversion model. It is leveraged to generate extensive speech data with the same content but different timbres by several minutes of real speech from unseen target speakers.