VoiceShop: A Unified Speech-to-Speech Framework for Identity-Preserving Zero-Shot Voice Editing
Philip Anastassiou, Zhenyu Tang, Kainan Peng, Dongya Jia, Jiaxin Li, Ming Tu, Yuping Wang, Yuxuan Wang, Mingbo Ma
TL;DR
VoiceShop presents a unified, zero-shot voice editing framework that preserves speaker timbre while enabling multi-attribute edits (age, gender, accent, style) in a single forward pass. It combines a diffusion-based backbone conditioned on global speaker embeddings and time-varying content with modular editing modules (CNF for age/gender and BN2BN for accent/style), trained separately to avoid fine-tuning. The approach demonstrates strong zero-shot performance on monolingual and cross-lingual tasks, with substantial disentanglement and favorable subjective and objective metrics across VC, accent conversion, and speech style transfer. The work advances practical VE capabilities, particularly the ability to edit multiple attributes simultaneously without parallel data or target speakers, though it acknowledges data balance and language coverage as future directions. The framework holds potential for broad prosthetic and cross-cultural communication applications, while underscoring ethical considerations around misrepresentation and the importance of detection and responsible use.
Abstract
We present VoiceShop, a novel speech-to-speech framework that can modify multiple attributes of speech, such as age, gender, accent, and speech style, in a single forward pass while preserving the input speaker's timbre. Previous works have been constrained to specialized models that can only edit these attributes individually and suffer from the following pitfalls: the magnitude of the conversion effect is weak, there is no zero-shot capability for out-of-distribution speakers, or the synthesized outputs exhibit undesirable timbre leakage. Our work proposes solutions for each of these issues in a simple modular framework based on a conditional diffusion backbone model with optional normalizing flow-based and sequence-to-sequence speaker attribute-editing modules, whose components can be combined or removed during inference to meet a wide array of tasks without additional model finetuning. Audio samples are available at \url{https://voiceshopai.github.io}.
