Table of Contents
Fetching ...

SF-Speech: Straightened Flow for Zero-Shot Voice Clone

Xuyuan Li, Zengqiang Shang, Hua Hua, Peiyang Shi, Chen Yang, Li Wang, Pengyuan Zhang

TL;DR

SF-Speech tackles zero-shot voice cloning with an ODE-based model trained via flow matching, addressing the instability introduced by standard Gaussian initializations. It introduces a lightweight, two-stage module to produce a deterministic initial distribution, coupled with a detail ODE to straighten reverse trajectories and reduce required solver steps. Across large-scale Emilia and in-the-wild MagicData, SF-Speech achieves state-of-the-art or competitive results with significantly faster inference (fewer NFEs) and improved trajectory stability, supported by curvature analyses and ablations. The approach demonstrates practical gains for fast, high-quality zero-shot TTS with in-context learning, while also offering insights into how trajectory curvature influences sampling efficiency.

Abstract

Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.

SF-Speech: Straightened Flow for Zero-Shot Voice Clone

TL;DR

SF-Speech tackles zero-shot voice cloning with an ODE-based model trained via flow matching, addressing the instability introduced by standard Gaussian initializations. It introduces a lightweight, two-stage module to produce a deterministic initial distribution, coupled with a detail ODE to straighten reverse trajectories and reduce required solver steps. Across large-scale Emilia and in-the-wild MagicData, SF-Speech achieves state-of-the-art or competitive results with significantly faster inference (fewer NFEs) and improved trajectory stability, supported by curvature analyses and ablations. The approach demonstrates practical gains for fast, high-quality zero-shot TTS with in-context learning, while also offering insights into how trajectory curvature influences sampling efficiency.

Abstract

Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.

Paper Structure

This paper contains 21 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Non-causal direction of forward trajectories((a),(c)) $\&$ Causal direction of reverse trajectories((b),(d)) predicted by the neural network estimator with 128 reverse steps. The upper((a),(b)) and lower((c),(d)) parts show the effect of different initial distributions on learned reverse trajectories.
  • Figure 2: Overall architecture of SF-Speech (left) and working diagram of ODE model (right) in SF-Speech.
  • Figure 3: The probability distribution of the PCCs-AV on time (T) and channel (C) axes for different features in English and Mandarin. "Mel" and "Linear" denote mel-spectrogram and linear spectrogram, respectively. "VAE" means the latent embedding from VITS kim2021conditional. "Encodec" and "Hificodec" represent the latent embedding from EnCodec defossez2022high and HiFi-codec yang2023hifi. "_VQ" stands for the indexes of the corresponding codebook.
  • Figure 4: Two versions of detail ODE structure. Version 1 consists of the Unet-style linked Transformer and 1D convolutional positional embedding. Version 2 combines 2D convolutional layers based on Version 1.
  • Figure 5: Objective metrics of generated speech at different NFEs in ZS-TTS test. The first row shows the results on the MagicData test set, and the second row presents results on the LibriSpeech-PC test-clean set. The solid line means models trained on Emilia, while the dotted line represents models trained on MagicData.
  • ...and 3 more figures