SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS
Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung
TL;DR
SPADE addresses the efficiency bottleneck of LLM-TTS by pruning non-essential Transformer layers using a Word Error Rate-based layer importance index, followed by adaptive multi-level distillation to recover autoregressive quality. The pruning removes low-WLI layers while the distillation aligns embeddings, latents, and attention from a full teacher, achieving up to 40% parameter reduction and up to 1.7x faster real-time factor with up to 20% VRAM savings, using less than 5% of pretraining data. The approach delivers near-parity perceptual metrics on zero-shot benchmarks while halving depth, enabling real-time, on-device capable LLM-TTS. These findings demonstrate a practical path to compact, high-fidelity TTS powered by LLMs.
Abstract
The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.
