Table of Contents
Fetching ...

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Hieu-Nghia Huynh-Nguyen, Huynh Nguyen Dang, Ngoc-Son Nguyen, Van Nguyen

TL;DR

Flamed-TTS introduces Flow Matching Attention-Free Models for zero-shot TTS, replacing self-attention in the denoiser with a semantically enriched prior to achieve low latency and high fidelity. The method jointly models probabilistic duration and silence generation to enrich temporal diversity, and uses a compact, non-autoregressive architecture based on a ConvNeXt-backed vector field estimator guided by optimal transport flow matching. Experiments on LibriTTS/LibriSpeech show Flamed-TTS achieves strong WER and speaker similarity with substantially lower data requirements and faster inference than state-of-the-art baselines. The work demonstrates that attention is not essential for high-quality zero-shot TTS when a semantically informed prior and probabilistic duration/pacing are leveraged, advancing practical, real-time generation with rich temporal dynamics.

Abstract

Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page https://flamed-tts.github.io.

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

TL;DR

Flamed-TTS introduces Flow Matching Attention-Free Models for zero-shot TTS, replacing self-attention in the denoiser with a semantically enriched prior to achieve low latency and high fidelity. The method jointly models probabilistic duration and silence generation to enrich temporal diversity, and uses a compact, non-autoregressive architecture based on a ConvNeXt-backed vector field estimator guided by optimal transport flow matching. Experiments on LibriTTS/LibriSpeech show Flamed-TTS achieves strong WER and speaker similarity with substantially lower data requirements and faster inference than state-of-the-art baselines. The work demonstrates that attention is not essential for high-quality zero-shot TTS when a semantically informed prior and probabilistic duration/pacing are leveraged, advancing practical, real-time generation with rich temporal dynamics.

Abstract

Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page https://flamed-tts.github.io.

Paper Structure

This paper contains 16 sections, 10 equations, 2 figures, 6 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of Flamed-TTS. The input speech prompt is first processed by the Codec Encoder, which produces six latent codes: one for prosody, two for content, and three for acoustic details. These encoded representations are then duplicated based on the durations predicted by the Duration Generator, while the Silence Generator inserts silences after each phoneme. The Code Decoder then generates predicted codes for the text prompt, conditioned on both the encoded phonemes and the latent representation of the reference speech. These predicted codes are converted into embeddings and merged before being passed through the Denoiser, where flow matching is performed. Finally, the output embeddings are fed into the Codec Decoder to synthesize the final speech waveform.
  • Figure 2: Code Decoder architecture. With the encoded phonemes, the codes are generated gradually by corresponding FFT Blocks with condition of prior code of speech prompt. These synthesized codes are combined by a CNN module. This embedding is finally finetuned by the Denoiser with flow matching algorithm.