Table of Contents
Fetching ...

SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

Perry Lam, Huayun Zhang, Nancy F. Chen, Berrak Sisman, Dorien Herremans

TL;DR

SNIPER training introduces a decreasing-sparsity regime for text-to-speech models, starting with high sparsity to accelerate early training and gradually densifying to zero to enhance eventual performance. Built on SNIP to select masks from gradient information and employing evolving-rate sparsity, SNIPER applied to FastSpeech2 on LJSpeech demonstrates faster early convergence and final performance that surpasses constant-sparsity baselines and closely rivals or exceeds dense models, with negligible training-time overhead. User-style naturalness and intelligibility metrics show SNIPER samples are often preferred and yield modest WER improvements. This approach offers a hardware-agnostic, training-time-efficient pathway to high-quality sparse TTS models with practical applicability.

Abstract

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to accelerate training first, followed by a progressive rate reduction to obtain better eventual performance. This decremental approach differs from current methods of incrementing sparsity to a desired target, which costs significantly more time than dense training. We call our method SNIPER training: Single-shot Initialization Pruning Evolving-Rate training. Our experiments on FastSpeech2 show that we were able to obtain better losses in the first few training epochs with SNIPER, and that the final SNIPER-trained models outperformed constant-sparsity models and edged out dense models, with negligible difference in training time.

SNIPER Training: Single-Shot Sparse Training for Text-to-Speech

TL;DR

SNIPER training introduces a decreasing-sparsity regime for text-to-speech models, starting with high sparsity to accelerate early training and gradually densifying to zero to enhance eventual performance. Built on SNIP to select masks from gradient information and employing evolving-rate sparsity, SNIPER applied to FastSpeech2 on LJSpeech demonstrates faster early convergence and final performance that surpasses constant-sparsity baselines and closely rivals or exceeds dense models, with negligible training-time overhead. User-style naturalness and intelligibility metrics show SNIPER samples are often preferred and yield modest WER improvements. This approach offers a hardware-agnostic, training-time-efficient pathway to high-quality sparse TTS models with practical applicability.

Abstract

Text-to-speech (TTS) models have achieved remarkable naturalness in recent years, yet like most deep neural models, they have more parameters than necessary. Sparse TTS models can improve on dense models via pruning and extra retraining, or converge faster than dense models with some performance loss. Thus, we propose training TTS models using decaying sparsity, i.e. a high initial sparsity to accelerate training first, followed by a progressive rate reduction to obtain better eventual performance. This decremental approach differs from current methods of incrementing sparsity to a desired target, which costs significantly more time than dense training. We call our method SNIPER training: Single-shot Initialization Pruning Evolving-Rate training. Our experiments on FastSpeech2 show that we were able to obtain better losses in the first few training epochs with SNIPER, and that the final SNIPER-trained models outperformed constant-sparsity models and edged out dense models, with negligible difference in training time.
Paper Structure (16 sections, 1 equation, 2 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: SNIPER example with 40% sparsity reducing to 20%.
  • Figure 2: WER weighted by ground truth length.