ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Yu Pan; Yanni Hu; Yuguang Yang; Jixun Yao; Jianhao Ye; Hongbin Zhou; Lei Ma; Jianjun Zhao

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

Yu Pan, Yanni Hu, Yuguang Yang, Jixun Yao, Jianhao Ye, Hongbin Zhou, Lei Ma, Jianjun Zhao

TL;DR

ClapFM-EVC tackles high-fidelity emotional voice conversion with flexible, interpretable control by combining soft-label guided cross-modal emotion representation (EVC-CLAP) and a flow-based decoder (AdaFM-VC). EVC-CLAP aligns emotional information from natural language prompts and emotion labels using a symmetric KL loss, while AdaFM-VC fuses emotional embeddings with content representations through FuEncoder and a conditional flow matching decoder conditioned on target emotions. The framework supports three inference modes: prompted emotion, reference speech, or retrieval-based references, enabling versatile user control. Experiments on a Mandarin expressive corpus show state-of-the-art emotion similarity and naturalness, with ablations verifying the contributions of soft-label learning, AIG, and the flow-based decoding, indicating strong practical potential for applications like voice assistants and dubbing.

Abstract

Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

TL;DR

Abstract

ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)