Table of Contents
Fetching ...

Voice Attribute Editing with Text Prompt

Zhengyan Sheng, Yang Ai, Li-Juan Liu, Jia Pan, Zhen-Hua Ling

TL;DR

The paper defines voice attribute editing with text prompts and introduces VoxEditor to perform relative modifications to voice attributes while preserving content. It advances a Residual Memory (ResMem) with a descriptor alignment via VADP to address text-prompt insufficiency and imprecision, backed by the VCTK-RVA dataset of relative attribute annotations. Through extensive objective and subjective evaluations, VoxEditor demonstrates strong alignment with text prompts and preservation of source voice characteristics, including generalization to unseen speakers. The work enables practical, controllable voice editing driven by natural language descriptors and provides resources and benchmarks to stimulate further research in fine-grained voice attribute control.

Abstract

Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website.

Voice Attribute Editing with Text Prompt

TL;DR

The paper defines voice attribute editing with text prompts and introduces VoxEditor to perform relative modifications to voice attributes while preserving content. It advances a Residual Memory (ResMem) with a descriptor alignment via VADP to address text-prompt insufficiency and imprecision, backed by the VCTK-RVA dataset of relative attribute annotations. Through extensive objective and subjective evaluations, VoxEditor demonstrates strong alignment with text prompts and preservation of source voice characteristics, including generalization to unseen speakers. The work enables practical, controllable voice editing driven by natural language descriptors and provides resources and benchmarks to stimulate further research in fine-grained voice attribute control.

Abstract

Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website.
Paper Structure (34 sections, 19 equations, 5 figures, 4 tables)

This paper contains 34 sections, 19 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of voice attribute editing with text prompt.
  • Figure 2: The overall flowchart of our proposed VoxEditor. During the training process, two speech segments ($Speech^A$ and $Speech^B$) are used, along with voice attribute descriptor $x$. In the inference process, the model takes source speech and the text prompt as inputs to generate edited speech. Here $Mel$ denotes the Mel spectrograms, $Linear$ denotes the linear spectrograms, $\bm{s}_m$ demotes the recalled main speaker embeddings, $\bm{s}_r$ denotes the recalled residual speaker embeddings and $\hat{\bm{s}}$ denotes the recalled speaker embeddings.
  • Figure 3: The variation of the TAVS metric for generated speech edited with different attributes under various values of editing degree $\alpha$.
  • Figure 4: The t-SNE visualization of the speaker embeddings extracted from generated speech edited with different attributes under various values of $\alpha$.
  • Figure 5: MOS-Cons and MOS-Corr scores with varying editing degrees $\alpha$. Edited speech tend to match both the source speech and text prompt in the highlighted area.