Fine-Grained and Interpretable Neural Speech Editing

Max Morrison; Cameron Churchwell; Nathan Pruyne; Bryan Pardo

Fine-Grained and Interpretable Neural Speech Editing

Max Morrison, Cameron Churchwell, Nathan Pruyne, Bryan Pardo

TL;DR

The paper addresses the limitation of entangled speech representations that hinder fine-grained editing. It introduces a transcript-free, disentangled, interpretable representation with four time-aligned components: sparse phonetic posteriorgrams ($SPPG$), Viterbi-decoded pitch, entropy-based periodicity, and multi-band $A$-weighted loudness, and couples it with data-augmentation strategies to separately control spectral balance and volume. A HiFi-GAN vocoder is trained on this representation, aided by a joint speaker embedding and a complex multi-band discriminator, enabling fast, high-fidelity edits of pitch, duration, volume, timbre, pronunciation, speaker identity, and spectral balance. The approach achieves reconstruction quality comparable to Mel-spectrograms, demonstrates accurate and perceptually improved prosody control, and shows effective disentanglement of spectral balance and volume, offering practical benefits for post-production and potential future capabilities like one-shot disentangled voice conversion.

Abstract

Fine-grained editing of speech attributes$\unicode{x2014}$such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants$\unicode{x2014}$is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch, duration, volume, timbral correlates of volume, pronunciation, speaker identity, and spectral balance.

Fine-Grained and Interpretable Neural Speech Editing

TL;DR

), Viterbi-decoded pitch, entropy-based periodicity, and multi-band

-weighted loudness, and couples it with data-augmentation strategies to separately control spectral balance and volume. A HiFi-GAN vocoder is trained on this representation, aided by a joint speaker embedding and a complex multi-band discriminator, enabling fast, high-fidelity edits of pitch, duration, volume, timbre, pronunciation, speaker identity, and spectral balance. The approach achieves reconstruction quality comparable to Mel-spectrograms, demonstrates accurate and perceptually improved prosody control, and shows effective disentanglement of spectral balance and volume, offering practical benefits for post-production and potential future capabilities like one-shot disentangled voice conversion.

Abstract

Fine-grained editing of speech attributes

such as prosody (i.e., the pitch, loudness, and phoneme durations), pronunciation, speaker identity, and formants

is useful for fine-tuning and fixing imperfections in human and AI-generated speech recordings for creation of podcasts, film dialogue, and video game dialogue. Existing speech synthesis systems use representations that entangle two or more of these attributes, prohibiting their use in fine-grained, disentangled editing. In this paper, we demonstrate the first disentangled and interpretable representation of speech with comparable subjective and objective vocoding reconstruction accuracy to Mel spectrograms. Our interpretable representation, combined with our proposed data augmentation method, enables training an existing neural vocoder to perform fast, accurate, and high-quality editing of pitch, duration, volume, timbral correlates of volume, pronunciation, speaker identity, and spectral balance.

Paper Structure (16 sections, 1 figure, 1 table)

This paper contains 16 sections, 1 figure, 1 table.

Introduction
A disentangled, interpretable representation of speech
Sparse phonetic posteriorgrams (SPPGs)
Viterbi-decoded pitch
Entropy-based periodicity
Multi-band A-weighted loudness
Controlling spectral balance and timbre
Neural speech editing model
Data
Evaluation
Objective metrics
Crowdsourced subjective evaluation
Evaluation of speech reconstruction
Evaluation of disentangled prosody control
Evaluation of data augmentation
...and 1 more sections

Figures (1)

Figure 1: Our proposed speech representation $|$ The time-varying components of our interpretable, disentangled speech representation applied to a recording of Arnold Schwarzenegger saying "I'll be back" from the movie The Terminator.

Fine-Grained and Interpretable Neural Speech Editing

TL;DR

Abstract

Fine-Grained and Interpretable Neural Speech Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (1)