Table of Contents
Fetching ...

Universal Speech Content Factorization

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

TL;DR

USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training, and can serve as the acoustic representation for training timbre-prompted text-to-speech models.

Abstract

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

Universal Speech Content Factorization

TL;DR

USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training, and can serve as the acoustic representation for training timbre-prompted text-to-speech models.

Abstract

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.
Paper Structure (21 sections, 5 equations, 3 figures, 6 tables)

This paper contains 21 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Full pipeline for voice conversion using USCF.
  • Figure 2: Decomposing speech into a content-factorized form through SCF. $\mathbf{X}_i$ are content-aligned WavLM features for different speakers. Content alignment for $\mathbf{X}_i$ is performed through kNN matching.
  • Figure 3: Left: formulation for $\mathbf{W}_1$, one of our proposed universal speech-to-content mappings. Right: Derivation of speaker transformation matrix $\mathbf{S}_4$ for unseen speaker $4$