Table of Contents
Fetching ...

Controllable Prosody Generation With Partial Inputs

Dan Andrei Iliescu, Devang Savita Ram Mohan, Tian Huey Teh, Zack Hodari

TL;DR

This work tackles fine-grained, human-in-the-loop control of prosody in text-to-speech by introducing MICVAE, a Multiple-Instance Conditional Variational Autoencoder that encodes a partial set of prosodic cues (control points) and generates a complete sequence of prosodic acoustic features (PAFs). The core innovation is a self-attention-based multiple-instance encoder that treats control points as an unordered bag and produces a latent representation that conditions a Gaussian latent variable for decoding full PAFs; the model demonstrates robustness to varying missingness patterns and efficiency with as few as four control points. Empirical results on a Latin American Spanish dataset show MICVAE outperforms baselines in objective RMSE and subjective soundness, with iterative refinement further improving alignment to target prosody. The work presents a reproducible HitL evaluation framework and establishes efficiency, robustness, and faithfulness as key criteria for controllable generative models, laying groundwork for practical interactive TTS systems.

Abstract

We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).

Controllable Prosody Generation With Partial Inputs

TL;DR

This work tackles fine-grained, human-in-the-loop control of prosody in text-to-speech by introducing MICVAE, a Multiple-Instance Conditional Variational Autoencoder that encodes a partial set of prosodic cues (control points) and generates a complete sequence of prosodic acoustic features (PAFs). The core innovation is a self-attention-based multiple-instance encoder that treats control points as an unordered bag and produces a latent representation that conditions a Gaussian latent variable for decoding full PAFs; the model demonstrates robustness to varying missingness patterns and efficiency with as few as four control points. Empirical results on a Latin American Spanish dataset show MICVAE outperforms baselines in objective RMSE and subjective soundness, with iterative refinement further improving alignment to target prosody. The work presents a reproducible HitL evaluation framework and establishes efficiency, robustness, and faithfulness as key criteria for controllable generative models, laying groundwork for practical interactive TTS systems.

Abstract

We address the problem of human-in-the-loop control for generating prosody in the context of text-to-speech synthesis. Controlling prosody is challenging because existing generative models lack an efficient interface through which users can modify the output quickly and precisely. To solve this, we introduce a novel framework whereby the user provides partial inputs and the generative model generates the missing features. We propose a model that is specifically designed to encode partial prosodic features and output complete audio. We show empirically that our model displays two essential qualities of a human-in-the-loop control mechanism: efficiency and robustness. With even a very small number of input values (~4), our model enables users to improve the quality of the output significantly in terms of listener preference (4:1).
Paper Structure (9 sections, 6 figures)

This paper contains 9 sections, 6 figures.

Figures (6)

  • Figure 1: Our model, MICVAE, encodes partial information and decodes a complete output. The inputs and outputs are prosodic acoustic features (PAFs), with 3 values (F0, energy and duration) for each phoneme in the sentence.
  • Figure 2: Our novel "multiple-instance" encoder. The control points of the partial input are treated as an unordered bag of features. They are aggregated into a fixed-length vector by a self-attention mechanism.
  • Figure 3: Our model produces plausible prosody, whereas Crude Control produces inconsistent prosody. The control points are shown with green vertical bars.
  • Figure 4: Our model controls prosody efficiently. Its generated PAFs are closer to the ground-truth rendition than is the manually modified output of the AFP, called CrudeControl.
  • Figure 5: Our model is robust to changes in the missingness pattern, whereas Masked CVAE only performs well when tested on the same missingness pattern as the one on which it was trained.
  • ...and 1 more figures