DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

Ngoc-Son Nguyen; Thanh V. T. Tran; Hieu-Nghia Huynh-Nguyen; Truong-Son Hy; Van Nguyen

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

TL;DR

DiFlow-TTS introduces zero-shot TTS by learning probability flows directly in the discrete space of factorized codec tokens. It combines a Phoneme-Content Mapper with a Factorized Discrete Flow Denoiser that separately models prosody and acoustic attributes, achieving strong naturalness, robust prosody, and compact, low-latency inference on a 470-hour LibriTTS subset. The approach demonstrates competitive WER and MOS against autoregressive and diffusion baselines while offering significant model-size reductions (up to 11.7x) and latency gains (up to 34x) suitable for latency-sensitive deployments. Limitations include speaker similarity under current conditioning, motivating future work on improved timbre modeling and cross-attention-based speaker conditioning to enhance voice cloning fidelity.

Abstract

This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up to 34 times faster than recent state-of-the-art baselines. Audio samples are available on our demo page: https://diflow-tts.github.io.

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

TL;DR

Abstract

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)

Theorems & Definitions (1)