Table of Contents
Fetching ...

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang

TL;DR

FLY-TTS addresses the challenge of deploying high-quality end-to-end TTS on resource-constrained devices by integrating a ConvNeXt-based decoder that synthesizes waveforms via fast iSTFT, and by applying grouped parameter-sharing to both the text encoder and flow-based prior. It further leverages a pre-trained WavLM discriminator to provide adversarial guidance without increasing inference cost. The system achieves substantial practical benefits, including an 8.8x speedup on CPU and 1.6x parameter compression, while maintaining MOS and intelligibility close to strong baselines on LJSpeech. These contributions enable fast, lightweight TTS suitable for edge devices, with demonstrated potential for extension to multi-speaker and style-aware synthesis.

Abstract

While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

TL;DR

FLY-TTS addresses the challenge of deploying high-quality end-to-end TTS on resource-constrained devices by integrating a ConvNeXt-based decoder that synthesizes waveforms via fast iSTFT, and by applying grouped parameter-sharing to both the text encoder and flow-based prior. It further leverages a pre-trained WavLM discriminator to provide adversarial guidance without increasing inference cost. The system achieves substantial practical benefits, including an 8.8x speedup on CPU and 1.6x parameter compression, while maintaining MOS and intelligibility close to strong baselines on LJSpeech. These contributions enable fast, lightweight TTS suitable for edge devices, with demonstrated potential for extension to multi-speaker and style-aware synthesis.

Abstract

While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.
Paper Structure (16 sections, 4 equations, 1 figure, 3 tables)

This paper contains 16 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Our proposed FLY-TTS. (a): The overall architecture of FLY-TTS. (b): Text encoder (top half) and flow-based model (bottom half) with grouped parameter-sharing. (c): ConvNeXt-based decoder. (d): Large pre-trained WavLM model for adversarial training.