Table of Contents
Fetching ...

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

Kei Akuzawa, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

The paper tackles unsupervised expressive speech synthesis for autoregressive models by introducing VAE-Loop, which integrates a conditional variational autoencoder with VoiceLoop to capture global speech characteristics as a latent variable. The model enables unsupervised control over expressions and improves speech quality, demonstrated on VCTK and Blizzard2012 through MOS and objective test errors. The key innovations are conditioning the autoregressive generator on a global latent z and applying KL cost annealing to encourage latent usage. Results show the latent space can modulate speaker traits and prosody, and F0 trajectories illustrate controllable expressiveness.

Abstract

Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder

TL;DR

The paper tackles unsupervised expressive speech synthesis for autoregressive models by introducing VAE-Loop, which integrates a conditional variational autoencoder with VoiceLoop to capture global speech characteristics as a latent variable. The model enables unsupervised control over expressions and improves speech quality, demonstrated on VCTK and Blizzard2012 through MOS and objective test errors. The key innovations are conditioning the autoregressive generator on a global latent z and applying KL cost annealing to encourage latent usage. Results show the latent space can modulate speaker traits and prosody, and F0 trajectories illustrate controllable expressiveness.

Abstract

Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.

Paper Structure

This paper contains 17 sections, 10 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Speech generating process of VAE-Loop
  • Figure 2: F0 Trajectories for two utterances generated by VAE-Loop, trained on VCTK. Here, $\bm{z}_1$ and $\bm{z}_2$ correspond to high-pitched (female as we heard) and low-pitched (male) voices, respectively. Averaged F0 trajectories are also shown, generated by interpolating between $\bm{z}_1$ and $\bm{z}_2$.
  • Figure 3: F0 Trajectories for two utterances generated by VAE-Loop, trained on Blizzard2012. Here, $\bm{z}_1$ and $\bm{z}_2$ correspond to voices with large (dramatic as we heard) and small (calm) pitch fluctuations, respectively.