Table of Contents
Fetching ...

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Tan Dat Nguyen, Jaehun Kim, Ji-Hoon Kim, Shukjae Choi, Youshin Lim, Joon Son Chung

TL;DR

SPADE addresses the efficiency bottleneck of LLM-TTS by pruning non-essential Transformer layers using a Word Error Rate-based layer importance index, followed by adaptive multi-level distillation to recover autoregressive quality. The pruning removes low-WLI layers while the distillation aligns embeddings, latents, and attention from a full teacher, achieving up to 40% parameter reduction and up to 1.7x faster real-time factor with up to 20% VRAM savings, using less than 5% of pretraining data. The approach delivers near-parity perceptual metrics on zero-shot benchmarks while halving depth, enabling real-time, on-device capable LLM-TTS. These findings demonstrate a practical path to compact, high-fidelity TTS powered by LLMs.

Abstract

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

TL;DR

SPADE addresses the efficiency bottleneck of LLM-TTS by pruning non-essential Transformer layers using a Word Error Rate-based layer importance index, followed by adaptive multi-level distillation to recover autoregressive quality. The pruning removes low-WLI layers while the distillation aligns embeddings, latents, and attention from a full teacher, achieving up to 40% parameter reduction and up to 1.7x faster real-time factor with up to 20% VRAM savings, using less than 5% of pretraining data. The approach delivers near-parity perceptual metrics on zero-shot benchmarks while halving depth, enabling real-time, on-device capable LLM-TTS. These findings demonstrate a practical path to compact, high-fidelity TTS powered by LLMs.

Abstract

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at https://mm.kaist.ac.kr/projects/SPADE/.

Paper Structure

This paper contains 10 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of SPADE. A large LLM-TTS model is compressed into a smaller student model through pruning and multi-level distillation. Parameters are copied from retained layers, while latent states are aligned across pruned segments to preserve synthesis quality.
  • Figure 2: WLI and cosine-based layer importance of (a) CosyVoice 2 and (b) LLaSA. High WLI indicates that WER increases significantly when the layer is removed, and high cosine-based importance indicates the input and output latents of the layer are dissimilar. We found that, based on WLI, the layers in the beginning, middle, and the end contribute critically to performance. Our method prunes the model by removing layers with least contribution to the performance.
  • Figure 3: SPADE effectively reduces half of the Transformer layers, reducing VRAM usage by $14\%$ for CosyVoice 2 and $20\%$ for LLaSA.