VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Philip Anastassiou; Zhenyu Tang; Kainan Peng; Dongya Jia; Jiaxin Li; Ming Tu; Yuping Wang; Yuxuan Wang; Mingbo Ma

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma

TL;DR

VoiceShop presents a unified, zero-shot voice editing framework that preserves speaker timbre while enabling multi-attribute edits (age, gender, accent, style) in a single forward pass. It combines a diffusion-based backbone conditioned on global speaker embeddings and time-varying content with modular editing modules (CNF for age/gender and BN2BN for accent/style), trained separately to avoid fine-tuning. The approach demonstrates strong zero-shot performance on monolingual and cross-lingual tasks, with substantial disentanglement and favorable subjective and objective metrics across VC, accent conversion, and speech style transfer. The work advances practical VE capabilities, particularly the ability to edit multiple attributes simultaneously without parallel data or target speakers, though it acknowledges data balance and language coverage as future directions. The framework holds potential for broad prosthetic and cross-cultural communication applications, while underscoring ethical considerations around misrepresentation and the importance of detection and responsible use.

Abstract

We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}.

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

TL;DR

Abstract

Paper Structure (37 sections, 11 equations, 13 figures, 12 tables)

This paper contains 37 sections, 11 equations, 13 figures, 12 tables.

Introduction
Related Work
Zero-Shot Voice Conversion.
Accent and Speech Style Conversion.
Age and Gender Editing.
VoiceShop
Method Overview
Large-Scale Pre-Training
Conformer-based ASR Model
Conditional Diffusion Backbone Model for Zero-Shot Voice Conversion
Neural Mel-Spectrogram Vocoder
Task-Specific Voice Editing Modules
Attribute-Conditional Normalizing Flow for Age and Gender Editing
Attributes Dataset.
Bottleneck-to-Bottleneck (BN2BN) Modeling for Many-to-Many Accent and Speech Style Conversion
...and 22 more sections

Figures (13)

Figure 1: The architecture of VoiceShop: The overall pipeline follows an analysis-synthesis approach. During the analysis step, the ECAPA-TDNN speaker encoder and pre-trained ASR module respectively decompose input speech into speaker identity represented by a global speaker embedding (SE) and local content embeddings represented by a sequence of bottleneck (BN) features. During the synthesis step, both SE and BN features condition the diffusion backbone model to reconstruct mel-spectrograms of input speech, followed by a vocoder to acquire time-domain waveforms. Voice editing: Editing is an optional step in between the analysis and synthesis steps reserved for inference. As highlighted in the two dashed boxes, an attribute-conditional flow module is used to globally edit the speaker embedding (e.g., change age and gender), whereas a BN2BN module is used to edit content embeddings (e.g., change prosody or speech style). These voice editing modules are trained separately from the generative backbone module and are used in a modular plug-and-play manner.
Figure 2: Attribute conditional flow editing module: Starting with a speaker's voice sample, we use a pre-trained attribute predictor to obtain age and gender labels, which we denote as the original attribute $\mathbf{a}$, and use the speaker encoder jointly trained with the diffusion backbone to extract the speaker embedding $\mathbf{w}$. The three steps of editing during inference proceed as follows: 1. An ODE solver utilizes the pre-trained CNF model which is conditioned on $\mathbf{a}$ and $t$ to reverse integrate $\mathbf{w}$ from $t_1$ to $t_0$ into $\mathbf{z_0}$, which is the encoded latent in the prior space. 2. Modify any or all attributes of the original speaker to obtain the new attribute vector $\mathbf{a'}$. 3. Use the ODE solver again for forward integration from $t_0$ to $t_1$ using the CNF model with $\mathbf{z_0}$ conditioned on $\mathbf{a'}$. The output is a new speaker embedding $\mathbf{w'}$, which embeds the edited attributes. When using $\mathbf{w'}$ with our diffusion backbone model, the generated voice should retain the unedited attributes in the original input voice (i.e., editing gender should not affect age and vice versa).
Figure 3: Bottleneck-to-bottleneck (BN2BN) modeling: Our BN2BN design maps the time-varying content features of utterances from an arbitrary number of source accents to those of an arbitrary number of target accents in a single model using a multi-decoder architecture.
Figure 4: Training configuration of cross-lingual AC using BN2BN modeling, featuring adversarial domain adaptation via gradient reversal to promote language-agnostic content representations in the learned latent space of the universal encoder. We use the same reference encoder proposed in wang2018style.
Figure 5: Visualizing accent transfer: By using the latent space of accent classifiers, we observe that input speech is clustered by source accent and that accent-converted speech predicted by our BN2BN models largely preserve these structures according to their target accents.
...and 8 more figures

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

TL;DR

Abstract

VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (13)