SF-Speech: Straightened Flow for Zero-Shot Voice Clone

Xuyuan Li; Zengqiang Shang; Hua Hua; Peiyang Shi; Chen Yang; Li Wang; Pengyuan Zhang

SF-Speech: Straightened Flow for Zero-Shot Voice Clone

Xuyuan Li, Zengqiang Shang, Hua Hua, Peiyang Shi, Chen Yang, Li Wang, Pengyuan Zhang

TL;DR

SF-Speech tackles zero-shot voice cloning with an ODE-based model trained via flow matching, addressing the instability introduced by standard Gaussian initializations. It introduces a lightweight, two-stage module to produce a deterministic initial distribution, coupled with a detail ODE to straighten reverse trajectories and reduce required solver steps. Across large-scale Emilia and in-the-wild MagicData, SF-Speech achieves state-of-the-art or competitive results with significantly faster inference (fewer NFEs) and improved trajectory stability, supported by curvature analyses and ablations. The approach demonstrates practical gains for fast, high-quality zero-shot TTS with in-context learning, while also offering insights into how trajectory curvature influences sampling efficiency.

Abstract

Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page\footnote{[Online] Available: https://lixuyuan102.github.io/Demo/}.

SF-Speech: Straightened Flow for Zero-Shot Voice Clone

TL;DR

Abstract

SF-Speech: Straightened Flow for Zero-Shot Voice Clone

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)