Table of Contents
Fetching ...

ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning

Tao Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng

TL;DR

ECTSpeech tackles the inefficiency of diffusion-based TTS by applying Easy Consistency Tuning to enable high-quality one-step speech generation without distillation. It introduces MSGate to improve multi-scale feature fusion in the denoiser and uses a two-stage training: EDM pretraining and consistency tuning. On LJSpeech, ECTSpeech achieves speech quality competitive with state-of-the-art methods while dramatically reducing training cost and enabling rapid one-step inference. This approach broadens diffusion-based TTS applicability by reducing training complexity and enabling efficient deployment.

Abstract

Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into consistency models, enabling efficient one-step generation. However, these approaches introduce additional training costs and rely heavily on the performance of pre-trained teacher models. In this paper, we propose ECTSpeech, a simple and effective one-step speech synthesis framework that, for the first time, incorporates the Easy Consistency Tuning (ECT) strategy into speech synthesis. By progressively tightening consistency constraints on a pre-trained diffusion model, ECTSpeech achieves high-quality one-step generation while significantly reducing training complexity. In addition, we design a multi-scale gate module (MSGate) to enhance the denoiser's ability to fuse features at different scales. Experimental results on the LJSpeech dataset demonstrate that ECTSpeech achieves audio quality comparable to state-of-the-art methods under single-step sampling, while substantially reducing the model's training cost and complexity.

ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning

TL;DR

ECTSpeech tackles the inefficiency of diffusion-based TTS by applying Easy Consistency Tuning to enable high-quality one-step speech generation without distillation. It introduces MSGate to improve multi-scale feature fusion in the denoiser and uses a two-stage training: EDM pretraining and consistency tuning. On LJSpeech, ECTSpeech achieves speech quality competitive with state-of-the-art methods while dramatically reducing training cost and enabling rapid one-step inference. This approach broadens diffusion-based TTS applicability by reducing training complexity and enabling efficient deployment.

Abstract

Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into consistency models, enabling efficient one-step generation. However, these approaches introduce additional training costs and rely heavily on the performance of pre-trained teacher models. In this paper, we propose ECTSpeech, a simple and effective one-step speech synthesis framework that, for the first time, incorporates the Easy Consistency Tuning (ECT) strategy into speech synthesis. By progressively tightening consistency constraints on a pre-trained diffusion model, ECTSpeech achieves high-quality one-step generation while significantly reducing training complexity. In addition, we design a multi-scale gate module (MSGate) to enhance the denoiser's ability to fuse features at different scales. Experimental results on the LJSpeech dataset demonstrate that ECTSpeech achieves audio quality comparable to state-of-the-art methods under single-step sampling, while substantially reducing the model's training cost and complexity.

Paper Structure

This paper contains 15 sections, 15 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Training cost and synthesis quality of different methods on the LJSpeech dataset. The horizontal axis denotes different methods (Grad-TTS, CoMoSpeech, ECTSpeech) and their training pipelines. The left vertical axis indicates the number of training steps (in millions), and the right vertical axis shows MOS scores. NFE represents the number of sampling steps used for inference in each method.
  • Figure 2: Overview of the proposed ECTSpeech framework. (a) System overview, where the dashed arrow indicates the consistency tuning process during fine-tuning. (b) The UNet decoder with multi-scale gate modules (MSGate) applied to skip connections. (c) Details of the MSGate module, illustrating the multi-branch fusion and gating mechanism.
  • Figure 3: Mel-spectrograms for qualitative comparison: (a) Ground Truth; (b) ECTSpeech (Pre-trained, 1 step); (c) ECTSpeech (Pre-trained, 50 steps); (d) ECTSpeech (1 step).