Controllable Prosody Generation With Partial Inputs
Dan Andrei Iliescu, Devang Savita Ram Mohan, Tian Huey Teh, Zack Hodari
TL;DR
This work tackles fine-grained, human-in-the-loop control of prosody in text-to-speech by introducing MICVAE, a Multiple-Instance Conditional Variational Autoencoder that encodes a partial set of prosodic cues (control points) and generates a complete sequence of prosodic acoustic features (PAFs). The core innovation is a self-attention-based multiple-instance encoder that treats control points as an unordered bag and produces a latent representation that conditions a Gaussian latent variable for decoding full PAFs; the model demonstrates robustness to varying missingness patterns and efficiency with as few as four control points. Empirical results on a Latin American Spanish dataset show MICVAE outperforms baselines in objective RMSE and subjective soundness, with iterative refinement further improving alignment to target prosody. The work presents a reproducible HitL evaluation framework and establishes efficiency, robustness, and faithfulness as key criteria for controllable generative models, laying groundwork for practical interactive TTS systems.
Abstract
We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
