DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Weidong Chen; Shan Yang; Guangzhi Li; Xixin Wu

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Weidong Chen, Shan Yang, Guangzhi Li, Xixin Wu

TL;DR

DrawSpeech tackles fine-grained prosody control in TTS by introducing prosody sketches as an intuitive conditioning signal. It presents a dedicated architecture comprising a Sketch Extractor, a Sketch-to-Contour Predictor, and a Latent Diffusion Model, all conditioned on sketches, predicted contours, and text, with a VAE-based mel-spectrogram representation and a neural vocoder. Across LJSpeech experiments, DrawSpeech achieves higher MOS and sketch-alignment scores than strong baselines and demonstrates precise, word-level prosody control, including when sketches are derived from reference speech. This approach offers a user-friendly means to specify prosodic patterns without relying on exact reference recordings and suggests future extensions to additional prosody controls such as jitter and shimmer for even richer expressiveness.

Abstract

Controlling text-to-speech (TTS) systems to synthesize speech with the prosodic characteristics expected by users has attracted much attention. To achieve controllability, current studies focus on two main directions: (1) using reference speech as prosody prompt to guide speech synthesis, and (2) using natural language descriptions to control the generation process. However, finding reference speech that exactly contains the prosody that users want to synthesize takes a lot of effort. Description-based guidance in TTS systems can only determine the overall prosody, which has difficulty in achieving fine-grained prosody control over the synthesized speech. In this paper, we propose DrawSpeech, a sketch-conditioned diffusion model capable of generating speech based on any prosody sketches drawn by users. Specifically, the prosody sketches are fed to DrawSpeech to provide a rough indication of the expected prosody trends. DrawSpeech then recovers the detailed pitch and energy contours based on the coarse sketches and synthesizes the desired speech. Experimental results show that DrawSpeech can generate speech with a wide variety of prosody and can precisely control the fine-grained prosody in a user-friendly manner. Our implementation and audio samples are publicly available.

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 3 figures, 3 tables)

This paper contains 16 sections, 7 equations, 3 figures, 3 tables.

Introduction
Methodology
Sketch Extractor
Sketch-to-Contour Predictor
Latent Diffusion Model
Experiments
Experimental Setup
Dataset
Implementation Details
Baselines
Evaluation Metrics
Experimental Results
Applying sketches to other models
Impact on sound quality
Precise prosody control
...and 1 more sections

Figures (3)

Figure 1: Overview structure of the proposed DrawSpeech. Paired speech and text data are used for training. User-supplied text and drawn pitch or energy sketch are used as inputs during inference.
Figure 2: Illustrations of (a) pitch contour and (b) pitch sketch. (c) Put the contour and the sketch in the same frame for direct comparison.
Figure 3: Drawing different sketches to achieve precise prosody control. The solid line represents the drawn pitch sketch. The points indicate the pitch of each frame in the synthesized speech.

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

TL;DR

Abstract

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Authors

TL;DR

Abstract

Table of Contents

Figures (3)