Table of Contents
Fetching ...

Incremental FastPitch: Chunk-based High Quality Text to Speech

Muyang Du, Chuan Liu, Junjie Lai

TL;DR

Real-time streaming TTS requires incremental synthesis with low latency. Incremental FastPitch introduces a chunk-based FFT decoder with fixed-size past states and receptive-field constrained training to enable chunk-wise Mel generation while preserving parallelism. The work provides design details, analyzes receptive field effects, and compares static versus dynamic masking, showing speech quality close to the parallel baseline with substantially lower latency (approximately 22× real-time). These findings offer a practical approach for low-latency TTS in streaming applications on GPUs, enabling faster, more responsive voice synthesis without sacrificing quality.

Abstract

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.

Incremental FastPitch: Chunk-based High Quality Text to Speech

TL;DR

Real-time streaming TTS requires incremental synthesis with low latency. Incremental FastPitch introduces a chunk-based FFT decoder with fixed-size past states and receptive-field constrained training to enable chunk-wise Mel generation while preserving parallelism. The work provides design details, analyzes receptive field effects, and compares static versus dynamic masking, showing speech quality close to the parallel baseline with substantially lower latency (approximately 22× real-time). These findings offer a practical approach for low-latency TTS in streaming applications on GPUs, enabling faster, more responsive voice synthesis without sacrificing quality.

Abstract

Parallel text-to-speech models have been widely applied for real-time speech synthesis, and they offer more controllability and a much faster synthesis process compared with conventional auto-regressive models. Although parallel models have benefits in many aspects, they become naturally unfit for incremental synthesis due to their fully parallel architecture such as transformer. In this work, we propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks by improving the architecture with chunk-based FFT blocks, training with receptive-field constrained chunk attention masks, and inference with fixed size past model states. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch, with a significant lower latency that allows even lower response time for real-time speech applications.
Paper Structure (13 sections, 3 equations, 4 figures, 1 table)

This paper contains 13 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Incremental FastPitch, Chunk-based FFT Block, and Chunk Mask for Receptive-Filed Constrained Training
  • Figure 2: Chunk-based decoder receptive field visualization.
  • Figure 3: MSD between the parallel FastPitch and the Incremental FastPitch trained with different types of masks, then inference with different chunk and past sizes. Each bar in the figure represents a specific (chunk size, past size) for inference. The horizontal axis describes the (chunk size, past size) used for training. A. Static Mask. B. Dynamic Mask.
  • Figure 4: Mel-spectrogram Visualization.