Table of Contents
Fetching ...

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu

TL;DR

Auto-regressive TTS suffers from high inference time due to long speech-token sequences. We propose VADUSA, which fuses MEDUSA-based speculative decoding with a tolerance mechanism and a TTS-oriented sparse tree to achieve fast decoding while maintaining high synthesis quality. On LibriTTS and LibriHeavy, it delivers substantial speedups up to $\sim 4\times$ and preserves competitive or improved WER and MOS predictions across semantic and acoustic tokens. The method generalizes across speech token types and scales to large datasets, offering a practical route for robust, fast AR TTS.

Abstract

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

TL;DR

Auto-regressive TTS suffers from high inference time due to long speech-token sequences. We propose VADUSA, which fuses MEDUSA-based speculative decoding with a tolerance mechanism and a TTS-oriented sparse tree to achieve fast decoding while maintaining high synthesis quality. On LibriTTS and LibriHeavy, it delivers substantial speedups up to and preserves competitive or improved WER and MOS predictions across semantic and acoustic tokens. The method generalizes across speech token types and scales to large datasets, offering a practical route for robust, fast AR TTS.

Abstract

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

Paper Structure

This paper contains 17 sections, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The overview framework of vanilla VADUSA.
  • Figure 2: Performances of different VADUSA configurations, with the same base model trained on HuBERT+k-means2048 tokens.
  • Figure 3: Speedup performance and mean number of accepted tokens in increasing number set in tolerance strategy, with base model trained on HuBERT+k-means 2048 tokens, selecting 64 candidates per decoding step.