Universal Speech Content Factorization

Henry Li Xinyuan; Zexin Cai; Lin Zhang; Leibny Paola García-Perera; Berrak Sisman; Sanjeev Khudanpur; Nicholas Andrews; Matthew Wiesner

Universal Speech Content Factorization

Henry Li Xinyuan, Zexin Cai, Lin Zhang, Leibny Paola García-Perera, Berrak Sisman, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner

TL;DR

USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training, and can serve as the acoustic representation for training timbre-prompted text-to-speech models.

Abstract

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

Universal Speech Content Factorization

TL;DR

Abstract

Paper Structure (21 sections, 5 equations, 3 figures, 6 tables)

This paper contains 21 sections, 5 equations, 3 figures, 6 tables.

Introduction
Related Works: Speech Disentanglement
Universal Speech Content Factorization
Closed-Set SCF
Approaches for Universal Speech to Content Mapping
Speaker Transformation Matrix Derivation
Experimental setup
Test Data
USCF-VC Details
Baselines
Metrics
Results
Voice Conversion Quality
Speaker ID within Phoneme
Ablations
...and 6 more sections

Figures (3)

Figure 1: Full pipeline for voice conversion using USCF.
Figure 2: Decomposing speech into a content-factorized form through SCF. $\mathbf{X}_i$ are content-aligned WavLM features for different speakers. Content alignment for $\mathbf{X}_i$ is performed through kNN matching.
Figure 3: Left: formulation for $\mathbf{W}_1$, one of our proposed universal speech-to-content mappings. Right: Derivation of speaker transformation matrix $\mathbf{S}_4$ for unseen speaker $4$

Universal Speech Content Factorization

TL;DR

Abstract

Universal Speech Content Factorization

Authors

TL;DR

Abstract

Table of Contents

Figures (3)