Table of Contents
Fetching ...

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

Siddharth Gururani, Kilol Gupta, Dhaval Shah, Zahra Shakeri, Jervis Pinto

TL;DR

This work tackles expressive TTS by enabling prosody transfer from a reference speech to synthesized speech using a compact, low-dimensional reference encoder. The authors introduce GS-TC2, which conditions a vanilla Tacotron2 on $7$ global prosody features derived from the reference's $F_0$ (log$F_0$) and $RMS$ contours, mapped to $512$ dimensions. Evaluations include MOS, side-by-side prosody transfer tests, and novel objective metrics (cosine and DTW distances) showing that GS-TC2 better matches reference prosody with only a modest decrease in naturalness. The approach is resource-efficient and produces natural, expressive speech, with potential for easy reference-based control and future extensions to richer prosodic cues.

Abstract

This paper presents a simple yet effective method to achieve prosody transfer from a reference speech signal to synthesized speech. The main idea is to incorporate well-known acoustic correlates of prosody such as pitch and loudness contours of the reference speech into a modern neural text-to-speech (TTS) synthesizer such as Tacotron2 (TC2). More specifically, a small set of acoustic features are extracted from reference audio and then used to condition a TC2 synthesizer. The trained model is evaluated using subjective listening tests and a novel objective evaluation of prosody transfer is proposed. Listening tests show that the synthesized speech is rated as highly natural and that prosody is successfully transferred from the reference speech signal to the synthesized signal.

Prosody Transfer in Neural Text to Speech Using Global Pitch and Loudness Features

TL;DR

This work tackles expressive TTS by enabling prosody transfer from a reference speech to synthesized speech using a compact, low-dimensional reference encoder. The authors introduce GS-TC2, which conditions a vanilla Tacotron2 on global prosody features derived from the reference's (log) and contours, mapped to dimensions. Evaluations include MOS, side-by-side prosody transfer tests, and novel objective metrics (cosine and DTW distances) showing that GS-TC2 better matches reference prosody with only a modest decrease in naturalness. The approach is resource-efficient and produces natural, expressive speech, with potential for easy reference-based control and future extensions to richer prosodic cues.

Abstract

This paper presents a simple yet effective method to achieve prosody transfer from a reference speech signal to synthesized speech. The main idea is to incorporate well-known acoustic correlates of prosody such as pitch and loudness contours of the reference speech into a modern neural text-to-speech (TTS) synthesizer such as Tacotron2 (TC2). More specifically, a small set of acoustic features are extracted from reference audio and then used to condition a TC2 synthesizer. The trained model is evaluated using subjective listening tests and a novel objective evaluation of prosody transfer is proposed. Listening tests show that the synthesized speech is rated as highly natural and that prosody is successfully transferred from the reference speech signal to the synthesized signal.

Paper Structure

This paper contains 10 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Model architecture for prosody transfer. The reference encoder is a linear transformation of summary statistic features which is followed by element-wise sum with each output of the text encoder. During training, prosody features are obtained from ground truth audio while during inference, they are extracted from reference speech.
  • Figure 2: Example depicting logF0 and RMS contours for two synthesized waveforms (from the same transcript) using TC2 (green) and GS-TC2 (blue) compared to GT reference waveform (red - different transcript). It can be seen that the logF0 and RMS statistics of the waveform generated using GS-TC2 are closer to the statistics of the reference waveform.
  • Figure 3: Scatter plots showing the relative distance of pitch and loudness features of randomly chosen samples synthesized by GS-TC2 and TC2 from a reference utterance. The 7-dimensional features are projected onto 2 dimensions using t-SNE. The GS-TC2 model clearly wins 14 out of 16 times in this trial.