Table of Contents
Fetching ...

Controlling your Attributes in Voice

Xuyuan Li, Zengqiang Shang. Li Wang, Pengyuan Zhang

TL;DR

This work addresses non-parallel attribute control in speech by proposing a two-stage framework that jointly learns attribute and identity representations. A GAN-based Speaker Representation Variational Autoencoder (SRVAE) decomposes speaker vectors into $z_{age}$, $z_{gender}$, and $z_{identity}$, enabling predefined attribute manipulation via cyclic consistency training. The subsequent Two-stage Voice Conversion (TSVC) uses an average generator and an ODE-based detail mapper to produce speaker-specific speech from attribute-conditioned features, preserving identity and improving quality. Overall, the method demonstrates effective age and gender control at the speech level with maintained speaker identity, suggesting practical applications in media production and privacy-preserving synthesis, with potential extensions to text-to-speech scenarios.

Abstract

Attribute control in generative tasks aims to modify personal attributes, such as age and gender while preserving the identity information in the source sample. Although significant progress has been made in controlling facial attributes in image generation, similar approaches for speech generation remain largely unexplored. This letter proposes a novel method for controlling speaker attributes in speech without parallel data. Our approach consists of two main components: a GAN-based speaker representation variational autoencoder that extracts speaker identity and attributes from speaker vector, and a two-stage voice conversion model that captures the natural expression of speaker attributes in speech. Experimental results show that our proposed method not only achieves attribute control at the speaker representation level but also enables manipulation of the speaker age and gender at the speech level while preserving speech quality and speaker identity.

Controlling your Attributes in Voice

TL;DR

This work addresses non-parallel attribute control in speech by proposing a two-stage framework that jointly learns attribute and identity representations. A GAN-based Speaker Representation Variational Autoencoder (SRVAE) decomposes speaker vectors into , , and , enabling predefined attribute manipulation via cyclic consistency training. The subsequent Two-stage Voice Conversion (TSVC) uses an average generator and an ODE-based detail mapper to produce speaker-specific speech from attribute-conditioned features, preserving identity and improving quality. Overall, the method demonstrates effective age and gender control at the speech level with maintained speaker identity, suggesting practical applications in media production and privacy-preserving synthesis, with potential extensions to text-to-speech scenarios.

Abstract

Attribute control in generative tasks aims to modify personal attributes, such as age and gender while preserving the identity information in the source sample. Although significant progress has been made in controlling facial attributes in image generation, similar approaches for speech generation remain largely unexplored. This letter proposes a novel method for controlling speaker attributes in speech without parallel data. Our approach consists of two main components: a GAN-based speaker representation variational autoencoder that extracts speaker identity and attributes from speaker vector, and a two-stage voice conversion model that captures the natural expression of speaker attributes in speech. Experimental results show that our proposed method not only achieves attribute control at the speaker representation level but also enables manipulation of the speaker age and gender at the speech level while preserving speech quality and speaker identity.
Paper Structure (12 sections, 3 figures, 2 tables, 1 algorithm)

This paper contains 12 sections, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overall workflow of our proposed method (a), detailed structure of SRVAE (b), and the detailed structure of TSVC (c). $L^p$ represents predefined attribute label, and $L^o$ represents original attribute label.
  • Figure 2: Confusion matrix for subjective prediction of speaker age.
  • Figure 3: AB test results of speaker identity.