Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Bohan Li, Hankun Wang, Situo Zhang, Yiwei Guo, Kai Yu
TL;DR
Auto-regressive TTS suffers from high inference time due to long speech-token sequences. We propose VADUSA, which fuses MEDUSA-based speculative decoding with a tolerance mechanism and a TTS-oriented sparse tree to achieve fast decoding while maintaining high synthesis quality. On LibriTTS and LibriHeavy, it delivers substantial speedups up to $\sim 4\times$ and preserves competitive or improved WER and MOS predictions across semantic and acoustic tokens. The method generalizes across speech token types and scales to large datasets, offering a practical route for robust, fast AR TTS.
Abstract
The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.
