Table of Contents
Fetching ...

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, Pengcheng Zhu

TL;DR

MeanVC tackles streaming zero-shot voice conversion by unifying autoregressive and non-autoregressive strengths through chunk-wise denoising and mean-flow diffusion, enabling high-fidelity synthesis with a single sampling step. It introduces mean flows to achieve 1-NFE sampling and adds diffusion adversarial post-training to reduce over-smoothing. The approach relies on a recognition-synthesis pipeline with a Diffusion Transformer decoder and a streaming ASR/vocoder stack, implemented in a lightweight 14M-parameter model. Evaluations on Emilia Mandarin and Seed-TTS show MeanVC outperforms existing streaming VC baselines in both subjective and objective measures while achieving real-time CPU inference.

Abstract

Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

TL;DR

MeanVC tackles streaming zero-shot voice conversion by unifying autoregressive and non-autoregressive strengths through chunk-wise denoising and mean-flow diffusion, enabling high-fidelity synthesis with a single sampling step. It introduces mean flows to achieve 1-NFE sampling and adds diffusion adversarial post-training to reduce over-smoothing. The approach relies on a recognition-synthesis pipeline with a Diffusion Transformer decoder and a streaming ASR/vocoder stack, implemented in a lightweight 14M-parameter model. Evaluations on Emilia Mandarin and Seed-TTS show MeanVC outperforms existing streaming VC baselines in both subjective and objective measures while achieving real-time CPU inference.

Abstract

Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.

Paper Structure

This paper contains 12 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overall architecture of our proposed MeanVC.
  • Figure 2: Chunk-wise causal mask in the DiT decoder. In this example, each noisy mel-spectrogram chunk $Z_i$ can attend to up to 3 preceding clean mel-spectrogram chunks $M_j$ (where $j \in [i-3, i-1]$) and itself. The green cells indicate allowed attention, while the white cells indicate restricted attention.