Task-Agnostic Structured Pruning of Speech Representation Models

Haoyu Wang; Siyuan Wang; Wei-Qiang Zhang; Hongbin Suo; Yulong Wan

Task-Agnostic Structured Pruning of Speech Representation Models

Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan

TL;DR

This work tackles the memory and compute barriers of large self-supervised speech models by introducing a task-agnostic pruning framework that combines fine-grained attention head pruning with STE-enhanced $L_0$ regularization in a multi-scale structured pruning scheme. The method preserves content-rich representations through selective pruning of attention dimensions and coarse structures, while a distillation-based objective and Lagrangian sparsity control maintain downstream performance. Empirical results on the SUPERB benchmark show the approach matches or exceeds dense-model performance on multiple tasks, surpasses distilled baselines, and achieves about 72% fewer parameters with roughly 2× faster inference. The work offers a practical path to deploy high-performing speech representations on resource-constrained devices without task-specific retraining.

Abstract

Self-supervised pre-trained models such as Wav2vec2, Hubert, and WavLM have been shown to significantly improve many speech tasks. However, their large memory and strong computational requirements hinder their industrial applicability. Structured pruning is a hardware-friendly model compression technique but usually results in a larger loss of accuracy. In this paper, we propose a fine-grained attention head pruning method to compensate for the performance degradation. In addition, we also introduce the straight through estimator into the L0 regularization to further accelerate the pruned model. Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks and outperforms the Wav2vec 2.0 base model on average, with 72% fewer parameters and 2 times faster inference speed.

Task-Agnostic Structured Pruning of Speech Representation Models

TL;DR

regularization in a multi-scale structured pruning scheme. The method preserves content-rich representations through selective pruning of attention dimensions and coarse structures, while a distillation-based objective and Lagrangian sparsity control maintain downstream performance. Empirical results on the SUPERB benchmark show the approach matches or exceeds dense-model performance on multiple tasks, surpasses distilled baselines, and achieves about 72% fewer parameters with roughly 2× faster inference. The work offers a practical path to deploy high-performing speech representations on resource-constrained devices without task-specific retraining.

Abstract

Paper Structure (14 sections, 6 equations, 3 figures, 3 tables)

This paper contains 14 sections, 6 equations, 3 figures, 3 tables.

Introduction
Backgrounds
Pre-trained Speech Representation Models
Pruning Based on the $L_0$ Regularization
Multi-scale Structured Pruning
Methods
Fine-grained Attention Head Pruning
Optimizing Pruning Masks with STE
Training Objective
Experiments
SUPERB
Pruning setup
Results
Conclusion

Figures (3)

Figure 1: (a) the possibility distribution of $z$ and $\bar{s}$. (b)$z$ and $\bar{s}$ as a function of log$\alpha$, averaged on 500 samples. $z$ can be exactly 0 or 1 or any value in between. In the shadow region, $\partial z/\partial \bar{s}=0$.
Figure 2: The relationship between the SUPERB score and the number of parameters.
Figure 3: The effectiveness of STE

Task-Agnostic Structured Pruning of Speech Representation Models

TL;DR

Abstract

Task-Agnostic Structured Pruning of Speech Representation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)