Table of Contents
Fetching ...

IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation

Wei Wang, Rong Cao, Yi Guo, Zhengyang Chen, Kuan Chen, Yuanyuan Huo

TL;DR

This work tackles slow inference in flow-based TTS by replacing instantaneous velocity modeling with integral velocity distillation, enabling few-step generation. IntMeanFlow distills averaged velocity over intervals, using $\bar{v}(z_t,t,r) = \frac{z_r - z_t}{r - t}$ and guiding a student with the teacher's instantaneous velocity while avoiding Jacobian-vector products and self-bootstrap during training. It introduces O3S, a ternary-search-based method to optimally place a fixed number of sampling steps across $[0,1]$, improving speech quality without additional inference cost. Evaluations on F5-TTS (text2mel) and CosyVoice2 (token2mel) show 1-NFE and 3-NFE respectively, delivering up to ~10x speedups with minimal quality loss and providing an initialization strategy to migrate existing flow-matching models to IntMeanFlow.

Abstract

Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes. To address these issues, we introduce IntMeanFlow, a framework for few-step speech generation with integral velocity distillation. By approximating average velocity with the teacher's instantaneous velocity over a temporal interval, IntMeanFlow eliminates the need for JVPs and self-bootstrap, improving stability and reducing GPU memory usage. We also propose the Optimal Step Sampling Search (O3S) algorithm, which identifies the model-specific optimal sampling steps, improving speech synthesis without additional inference overhead. Experiments show that IntMeanFlow achieves 1-NFE inference for token-to-spectrogram and 3-NFE for text-to-spectrogram tasks while maintaining high-quality synthesis. Demo samples are available at https://vvwangvv.github.io/intmeanflow.

IntMeanFlow: Few-step Speech Generation with Integral Velocity Distillation

TL;DR

This work tackles slow inference in flow-based TTS by replacing instantaneous velocity modeling with integral velocity distillation, enabling few-step generation. IntMeanFlow distills averaged velocity over intervals, using and guiding a student with the teacher's instantaneous velocity while avoiding Jacobian-vector products and self-bootstrap during training. It introduces O3S, a ternary-search-based method to optimally place a fixed number of sampling steps across , improving speech quality without additional inference cost. Evaluations on F5-TTS (text2mel) and CosyVoice2 (token2mel) show 1-NFE and 3-NFE respectively, delivering up to ~10x speedups with minimal quality loss and providing an initialization strategy to migrate existing flow-matching models to IntMeanFlow.

Abstract

Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes. To address these issues, we introduce IntMeanFlow, a framework for few-step speech generation with integral velocity distillation. By approximating average velocity with the teacher's instantaneous velocity over a temporal interval, IntMeanFlow eliminates the need for JVPs and self-bootstrap, improving stability and reducing GPU memory usage. We also propose the Optimal Step Sampling Search (O3S) algorithm, which identifies the model-specific optimal sampling steps, improving speech synthesis without additional inference overhead. Experiments show that IntMeanFlow achieves 1-NFE inference for token-to-spectrogram and 3-NFE for text-to-spectrogram tasks while maintaining high-quality synthesis. Demo samples are available at https://vvwangvv.github.io/intmeanflow.

Paper Structure

This paper contains 16 sections, 10 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of IntMeanFlow: The student model learns the averaged velocity from the instantaneous velocities provided by the teacher model at multiple states.
  • Figure 2: When fixing all but one of the sampling steps, the speaker similarity metric exhibits near-convex behavior.
  • Figure 3: NFE vs. WER (%) and SIM-o for the Flow Matching teacher and IntMeanFlow student models