DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
Weidong Chen, Shan Yang, Guangzhi Li, Xixin Wu
TL;DR
DrawSpeech tackles fine-grained prosody control in TTS by introducing prosody sketches as an intuitive conditioning signal. It presents a dedicated architecture comprising a Sketch Extractor, a Sketch-to-Contour Predictor, and a Latent Diffusion Model, all conditioned on sketches, predicted contours, and text, with a VAE-based mel-spectrogram representation and a neural vocoder. Across LJSpeech experiments, DrawSpeech achieves higher MOS and sketch-alignment scores than strong baselines and demonstrates precise, word-level prosody control, including when sketches are derived from reference speech. This approach offers a user-friendly means to specify prosodic patterns without relying on exact reference recordings and suggests future extensions to additional prosody controls such as jitter and shimmer for even richer expressiveness.
Abstract
Controlling text-to-speech (TTS) systems to synthesize speech with the prosodic characteristics expected by users has attracted much attention. To achieve controllability, current studies focus on two main directions: (1) using reference speech as prosody prompt to guide speech synthesis, and (2) using natural language descriptions to control the generation process. However, finding reference speech that exactly contains the prosody that users want to synthesize takes a lot of effort. Description-based guidance in TTS systems can only determine the overall prosody, which has difficulty in achieving fine-grained prosody control over the synthesized speech. In this paper, we propose DrawSpeech, a sketch-conditioned diffusion model capable of generating speech based on any prosody sketches drawn by users. Specifically, the prosody sketches are fed to DrawSpeech to provide a rough indication of the expected prosody trends. DrawSpeech then recovers the detailed pitch and energy contours based on the coarse sketches and synthesizes the desired speech. Experimental results show that DrawSpeech can generate speech with a wide variety of prosody and can precisely control the fine-grained prosody in a user-friendly manner. Our implementation and audio samples are publicly available.
