Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model
Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao
TL;DR
The paper tackles expressiveness in voice conversion by moving beyond speaker identity preservation to richer prosody and emotion transfer. It introduces PFlow-VC, a conditional flow matching framework that conditions Mel synthesis on discrete pitch tokens from a self-supervised pitch VQVAE, along with target speaker prompts and time-varying timbre representations. Key contributions include a pitch quantization scheme with SMN-logf0, a dual-timbre encoder that combines global embeddings and time-varying timbre tokens, and an OT-CFM-based decoder that supports in-context pitch modeling and efficient generation. Experimental results on unseen LibriTTS and ESD data demonstrate superior zero-shot timbre transfer and emotion style consistency, with ablations validating the importance of pitch tokens and dynamic timbre representations for expressive VC.
Abstract
This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page https://speechai-demo.github.io/PFlow-VC/.
