Task-Agnostic Structured Pruning of Speech Representation Models
Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan
TL;DR
This work tackles the memory and compute barriers of large self-supervised speech models by introducing a task-agnostic pruning framework that combines fine-grained attention head pruning with STE-enhanced $L_0$ regularization in a multi-scale structured pruning scheme. The method preserves content-rich representations through selective pruning of attention dimensions and coarse structures, while a distillation-based objective and Lagrangian sparsity control maintain downstream performance. Empirical results on the SUPERB benchmark show the approach matches or exceeds dense-model performance on multiple tasks, surpasses distilled baselines, and achieves about 72% fewer parameters with roughly 2× faster inference. The work offers a practical path to deploy high-performing speech representations on resource-constrained devices without task-specific retraining.
Abstract
Self-supervised pre-trained models such as Wav2vec2, Hubert, and WavLM have been shown to significantly improve many speech tasks. However, their large memory and strong computational requirements hinder their industrial applicability. Structured pruning is a hardware-friendly model compression technique but usually results in a larger loss of accuracy. In this paper, we propose a fine-grained attention head pruning method to compensate for the performance degradation. In addition, we also introduce the straight through estimator into the L0 regularization to further accelerate the pruned model. Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks and outperforms the Wav2vec 2.0 base model on average, with 72% fewer parameters and 2 times faster inference speed.
