Table of Contents
Fetching ...

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

Ziyu Zhang, Hanzhao Li, Jingbin Hu, Wenhao Li, Lei Xie

TL;DR

This work analyzes the hierarchical structure of text-prompt-driven style embeddings in TTS and leverages this insight to design HiStyle, a two-stage hierarchical predictor that first obtains a global speaker embedding and then a fine-grained style embedding conditioned on a text prompt. The model employs diffusion-based transformer blocks and optimizes with a combination of $L_{\mathrm{MSE}}$ and a contrastive loss to align text and audio spaces, while a novel data-annotation pipeline combines statistical thresholds with human perceptual feedback to generate perceptually coherent style labels. Empirical results on a 2000-hour expressive dataset show that HiStyle delivers superior controllability across multiple style attributes and maintains high naturalness and intelligibility, outperforming several baselines and ablations. The proposed framework offers a generalizable approach to text-prompt-guided controllable TTS, with potential applicability to a wide range of voice styles and languages.

Abstract

Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

TL;DR

This work analyzes the hierarchical structure of text-prompt-driven style embeddings in TTS and leverages this insight to design HiStyle, a two-stage hierarchical predictor that first obtains a global speaker embedding and then a fine-grained style embedding conditioned on a text prompt. The model employs diffusion-based transformer blocks and optimizes with a combination of and a contrastive loss to align text and audio spaces, while a novel data-annotation pipeline combines statistical thresholds with human perceptual feedback to generate perceptually coherent style labels. Empirical results on a 2000-hour expressive dataset show that HiStyle delivers superior controllability across multiple style attributes and maintains high naturalness and intelligibility, outperforming several baselines and ablations. The proposed framework offers a generalizable approach to text-prompt-guided controllable TTS, with potential applicability to a wide range of voice styles and languages.

Abstract

Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.

Paper Structure

This paper contains 13 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: t-SNE visualization of global embeddings from Transformer-based encoder: (a) Speaker-Level Clustering, (b) Pitch Fluctuation-Level Sub-Clustering.
  • Figure 2: Overview architecture of the HiStyle embedding predictor. The subplot in the upper-left corner illustrates the prediction process of our first stage. The Speaker Embedding Predictor takes the text prompt embedding as a condition and uses the reference speaker embedding to predict the predicted speaker embedding. Similarly, the subplot in the lower-left corner represents the second stage, where the Style Embedding Predictor takes the text prompt embedding along with the residual connection of the intermediate result (predicted speaker embedding) from the first stage as conditions, and leverages the fusion embedding to predict the predicted style embedding. The subplot on the right depicts the detailed architecture and training process of the two Embedding Predictors.
  • Figure 3: Iterative Annotation Pipeline Combining Statistical Thresholding and Human Perception