Table of Contents
Fetching ...

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, Rif A. Saurous

TL;DR

This paper introduces Global Style Tokens (GSTs), an unsupervised style modeling mechanism integrated into Tacotron to capture, control, and transfer speaking style. GSTs use a reference encoder and a bank of style tokens attended by the reference to produce a style embedding that conditions the text encoder, enabling token-based style control, style scaling, and both parallel and non-parallel style transfer. The authors demonstrate interpretability of tokens, robustness to noisy found data, and the ability to distinguish style from content, including speaker identity and noise. These results suggest GSTs offer a scalable, data-efficient approach to expressive, long-form TTS and may generalize to other domains requiring interpretable, controllable latent factors. Overall, GSTs provide a principled framework for unsupervised prosody modeling with practical benefits for real-world speech synthesis.

Abstract

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

TL;DR

This paper introduces Global Style Tokens (GSTs), an unsupervised style modeling mechanism integrated into Tacotron to capture, control, and transfer speaking style. GSTs use a reference encoder and a bank of style tokens attended by the reference to produce a style embedding that conditions the text encoder, enabling token-based style control, style scaling, and both parallel and non-parallel style transfer. The authors demonstrate interpretability of tokens, robustness to noisy found data, and the ability to distinguish style from content, including speaker identity and noise. These results suggest GSTs offer a scalable, data-efficient approach to expressive, long-form TTS and may generalize to other domains requiring interpretable, controllable latent factors. Overall, GSTs provide a principled framework for unsupervised prosody modeling with practical benefits for real-world speech synthesis.

Abstract

In this work, we propose "global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable "labels" they generate can be used to control synthesis in novel ways, such as varying speed and speaking style - independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Paper Structure

This paper contains 28 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Model diagram. During training, the log-mel spectrogram of the training target is fed to the reference encoder followed by a style token layer. The resulting style embedding is used to condition the Tacotron text encoder states. During inference, we can feed an arbitrary reference signal to synthesize text with its speaking style. Alternatively, we can remove the reference encoder and directly control synthesis using the learned interpretable tokens.
  • Figure 2: F0 and C0 (log scale) of two different sentences, synthesized using three tokens. Independent of the text content, the same token exhibits the same F0/C0 trend relative to the other tokens.
  • Figure 3: Effect of token scaling. From left to right, we scale the two tokens by -0.3, 0.1, 0.3, 0.5, respectively. Note that the model seems to exhibit the reverse effect (e.g. fast to slow or animated to calm) with a negative scale, which is never seen during training.
  • Figure 4: Log-mel spectrograms for parallel style transfer.
  • Figure 5: Robustness in non-parallel style transfer. Left to right: attention alignments obtained from feeding three references whose text lengths are 10, 96, 321 characters, respectively. The target text length is 258 characters.
  • ...and 3 more figures