Table of Contents
Fetching ...

MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

TL;DR

The proposed MeanVoiceFlow is a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation, and introduces a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging.

Abstract

In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.

MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows

TL;DR

The proposed MeanVoiceFlow is a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation, and introduces a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging.

Abstract

In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.
Paper Structure (14 sections, 10 equations, 3 figures, 3 tables)

This paper contains 14 sections, 10 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of (a) instantaneous velocity used in conventional flow matching and (b) average velocity used in mean flows. (a) Instantaneous velocity$v(z_t, t)$ (blue arrow) represents the tangent direction of the path for a single time step $t$. (b) Average velocity$u(z_t, r, t)$ (orange arrow) aligns with the displacement between two time steps $r$ and $t$. In MeanVoiceFlow, a zero-input constraint is imposed on $u(\bar{\epsilon}, 0, 1)$ (green arrow), the average velocity for a zero-input sample $\bar{\epsilon} = 0$, using a structural margin reconstruction loss to moderately guide learning.
  • Figure 2: Comparison of input types during inference and training. Previous studies (e.g., TKanekoIS2024) use input type (c) during training and (b) during inference, causing a training--inference mismatch. In contrast, the proposed method uses (d) during training and (b) during inference, effectively eliminating this mismatch.
  • Figure 3: Analysis of conditional diffused-input training. Conditional diffused-input training (pink line) enhances both robustness to the mixing ratio $t'$ and peak performance.