Table of Contents
Fetching ...

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

TL;DR

DiFlow-TTS introduces zero-shot TTS by learning probability flows directly in the discrete space of factorized codec tokens. It combines a Phoneme-Content Mapper with a Factorized Discrete Flow Denoiser that separately models prosody and acoustic attributes, achieving strong naturalness, robust prosody, and compact, low-latency inference on a 470-hour LibriTTS subset. The approach demonstrates competitive WER and MOS against autoregressive and diffusion baselines while offering significant model-size reductions (up to 11.7x) and latency gains (up to 34x) suitable for latency-sensitive deployments. Limitations include speaker similarity under current conditioning, motivating future work on improved timbre modeling and cross-attention-based speaker conditioning to enhance voice cloning fidelity.

Abstract

This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up to 34 times faster than recent state-of-the-art baselines. Audio samples are available on our demo page: https://diflow-tts.github.io.

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

TL;DR

DiFlow-TTS introduces zero-shot TTS by learning probability flows directly in the discrete space of factorized codec tokens. It combines a Phoneme-Content Mapper with a Factorized Discrete Flow Denoiser that separately models prosody and acoustic attributes, achieving strong naturalness, robust prosody, and compact, low-latency inference on a 470-hour LibriTTS subset. The approach demonstrates competitive WER and MOS against autoregressive and diffusion baselines while offering significant model-size reductions (up to 11.7x) and latency gains (up to 34x) suitable for latency-sensitive deployments. Limitations include speaker similarity under current conditioning, motivating future work on improved timbre modeling and cross-attention-based speaker conditioning to enhance voice cloning fidelity.

Abstract

This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up to 34 times faster than recent state-of-the-art baselines. Audio samples are available on our demo page: https://diflow-tts.github.io.

Paper Structure

This paper contains 43 sections, 7 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of DiFlow-TTS. A Codec Encoder decomposes the speech prompt into speaker, prosody, and acoustic tokens, while the Phoneme-Content Mapper converts text into content embeddings. Conditioned on these, the Factorized Discrete Flow Denoiser generates prosody and acoustic tokens, and the Codec Decoder reconstructs the waveform.
  • Figure 2: The detailed architecture of DiFlow-TTS comprises three main components: (a)Speech Tokenizer, which extracts factorized discrete tokens and a speaker embedding from a raw speech; (b)Phoneme-Content Mapper, which maps input phonemes to discrete content tokens and generates the corresponding content embeddings; and (c)Factorized Discrete Flow Denoiser, which performs discrete flow matching conditioned on the content embeddings, speaker embedding, and the discrete prosody and acoustic tokens derived from the reference speech prompt.
  • Figure 3: UTMOS vs. NFE for different prompt durations.
  • Figure 4: Performance across different SNR levels in terms of UTMOS (left) and WER (right).
  • Figure 5: The detailed architecture of the DiT block.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition B.1