Table of Contents
Fetching ...

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment

Hyung-Seok Oh, Sang-Hoon Lee, Deok-Hyeon Cho, Seong-Whan Lee

TL;DR

DurFlex-EVC addresses the challenge of emotion-driven voice conversion with variable durations without relying on text or phonetic alignments. It introduces a unit-aligning, style-disentangling framework built around a style autoencoder, a unit aligner, a hierarchical stylize encoder, and a diffusion-based Mel-spectrogram generator to enable parallel generation and robust duration control. The system leverages discrete SSL-based speech units (HuBERT) and a stochastic duration predictor to model emotion-driven duration dynamics, achieving high naturalness, preserved speaker identity, and strong emotional expressiveness across seen and unseen speakers. Empirical results on the ESD dataset show superiority over strong baselines in both subjective and objective metrics, with notable improvements in pronunciation accuracy, speaker similarity, and emotional alignment, indicating practical potential for real-time, expressive voice conversion without text-dependent preprocessing.

Abstract

Emotional voice conversion (EVC) involves modifying various acoustic characteristics, such as pitch and spectral envelope, to match a desired emotional state while preserving the speaker's identity. Existing EVC methods often rely on text transcriptions or time-alignment information and struggle to handle varying speech durations effectively. In this paper, we propose DurFlex-EVC, a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. Additionally, we design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech. We further enhance emotional expressiveness through a hierarchical stylize encoder that applies the target emotional style at multiple hierarchical levels, refining the stylization process to improve the naturalness and expressiveness of the converted speech. Experimental results from subjective and objective evaluations demonstrate that our approach outperforms baseline models, effectively handling duration variability and enhancing emotional expressiveness in the converted speech.

DurFlex-EVC: Duration-Flexible Emotional Voice Conversion Leveraging Discrete Representations without Text Alignment

TL;DR

DurFlex-EVC addresses the challenge of emotion-driven voice conversion with variable durations without relying on text or phonetic alignments. It introduces a unit-aligning, style-disentangling framework built around a style autoencoder, a unit aligner, a hierarchical stylize encoder, and a diffusion-based Mel-spectrogram generator to enable parallel generation and robust duration control. The system leverages discrete SSL-based speech units (HuBERT) and a stochastic duration predictor to model emotion-driven duration dynamics, achieving high naturalness, preserved speaker identity, and strong emotional expressiveness across seen and unseen speakers. Empirical results on the ESD dataset show superiority over strong baselines in both subjective and objective metrics, with notable improvements in pronunciation accuracy, speaker similarity, and emotional alignment, indicating practical potential for real-time, expressive voice conversion without text-dependent preprocessing.

Abstract

Emotional voice conversion (EVC) involves modifying various acoustic characteristics, such as pitch and spectral envelope, to match a desired emotional state while preserving the speaker's identity. Existing EVC methods often rely on text transcriptions or time-alignment information and struggle to handle varying speech durations effectively. In this paper, we propose DurFlex-EVC, a duration-flexible EVC framework that operates without the need for text or alignment information. We introduce a unit aligner that models contextual information by aligning speech with discrete units representing content, eliminating the need for text or speech-text alignment. Additionally, we design a style autoencoder that effectively disentangles content and emotional style, allowing precise manipulation of the emotional characteristics of the speech. We further enhance emotional expressiveness through a hierarchical stylize encoder that applies the target emotional style at multiple hierarchical levels, refining the stylization process to improve the naturalness and expressiveness of the converted speech. Experimental results from subjective and objective evaluations demonstrate that our approach outperforms baseline models, effectively handling duration variability and enhancing emotional expressiveness in the converted speech.
Paper Structure (36 sections, 12 equations, 11 figures, 12 tables)

This paper contains 36 sections, 12 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Overall framework of the proposed method. The feature extractor transforms the source audio into input features. These features are subsequently disentangled and reconditioned by the style autoencoder. The unit aligner is responsible for providing unit-level context and performing duration modeling. In addition, the hierarchical style encoder encodes features at both the unit and frame levels. Mel-spectrogram is subsequently produced by the generator. In this figure, "DP" represents the duration predictor, $\mathcal{LR}$ denotes the length regulator, while "Q", "K" and "V" represent the query, key and value of the cross-attention in the unit aligner, respectively. $\otimes$ denotes the concatenate operation. $w_{src}$ represents the source style vector and $w_{tgt}$ represents the target style vector. The style autoencoder disentangles the source style from the features and applies the target style, while the hierarchical stylize encoder and generator take the target style as a condition.
  • Figure 2: Unit-level pooling and frame-level scaling. (a) Latent is pooled on average based on unit durations, and (b) Latent is expanded by being duplicated a number of times corresponding to the duration count.
  • Figure 3: Visualize t-SNE of emotion2vec features for speaker and emotion.
  • Figure 4: Comparison of the SECS scores of the comparison models for all combinations of emotion conversion.
  • Figure 5: Comparison of the EECS scores of the comparison models for all combinations of emotion conversion.
  • ...and 6 more figures