Table of Contents
Fetching ...

Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling

Sotirios Karapiperis, Nikolaos Ellinas, Alexandra Vioni, Junkwang Oh, Gunu Jho, Inchul Hwang, Spyros Raptis

TL;DR

The paper investigates disentanglement in a phoneme-level RVQ-VAE neural codec for prosody modeling, aiming to separate linguistic content and speaker identity from prosody. It proposes a two-level RVQ with phoneme-level conditioning and a speaker embedding, implemented with Conformer-based encoder/decoder and a Gaussian Upsampler, trained via reconstruction losses and EMA for codebooks. Extensive experiments demonstrate high code usage, robust disentanglement from linguistic and speaker information, and principal components that correspond to pitch ($F0$) and energy ($RMS$); task-based evaluations confirm intelligibility, natural cross-resynthesis, and transferable prosody. The latent space is compact, interpretable, and competitive with continuous models, suggesting practical benefits for controllable prosody and potential for scalable priors and phoneme-driven latent prediction in TTS and voice conversion.

Abstract

Most of the prevalent approaches in speech prosody modeling rely on learning global style representations in a continuous latent space which encode and transfer the attributes of reference speech. However, recent work on neural codecs which are based on Residual Vector Quantization (RVQ) already shows great potential offering distinct advantages. We investigate the prosody modeling capabilities of the discrete space of such an RVQ-VAE model, modifying it to operate on the phoneme-level. We condition both the encoder and decoder of the model on linguistic representations and apply a global speaker embedding in order to factor out both phonetic and speaker information. We conduct an extensive set of investigations based on subjective experiments and objective measures to show that the phoneme-level discrete latent representations obtained this way achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable. The latent space turns out to have interpretable structure with its principal components corresponding to pitch and energy.

Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling

TL;DR

The paper investigates disentanglement in a phoneme-level RVQ-VAE neural codec for prosody modeling, aiming to separate linguistic content and speaker identity from prosody. It proposes a two-level RVQ with phoneme-level conditioning and a speaker embedding, implemented with Conformer-based encoder/decoder and a Gaussian Upsampler, trained via reconstruction losses and EMA for codebooks. Extensive experiments demonstrate high code usage, robust disentanglement from linguistic and speaker information, and principal components that correspond to pitch () and energy (); task-based evaluations confirm intelligibility, natural cross-resynthesis, and transferable prosody. The latent space is compact, interpretable, and competitive with continuous models, suggesting practical benefits for controllable prosody and potential for scalable priors and phoneme-driven latent prediction in TTS and voice conversion.

Abstract

Most of the prevalent approaches in speech prosody modeling rely on learning global style representations in a continuous latent space which encode and transfer the attributes of reference speech. However, recent work on neural codecs which are based on Residual Vector Quantization (RVQ) already shows great potential offering distinct advantages. We investigate the prosody modeling capabilities of the discrete space of such an RVQ-VAE model, modifying it to operate on the phoneme-level. We condition both the encoder and decoder of the model on linguistic representations and apply a global speaker embedding in order to factor out both phonetic and speaker information. We conduct an extensive set of investigations based on subjective experiments and objective measures to show that the phoneme-level discrete latent representations obtained this way achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable. The latent space turns out to have interpretable structure with its principal components corresponding to pitch and energy.
Paper Structure (16 sections, 4 figures, 8 tables)

This paper contains 16 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The adapted architecture.
  • Figure 2: Visualization of the distances between the histograms of each phoneme.
  • Figure 3: Selected paths in the latent space
  • Figure 4: Pitch contours of a ground-truth, resynthesized and cross-resynthesized speech utterance