Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Yujin Wang; Changli Tang; Ziyang Ma; Zhisheng Zheng; Xie Chen; Wei-Qiang Zhang

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen, Wei-Qiang Zhang

TL;DR

This work tackles the practical challenge of deploying self-supervised speech models by distilling HuBERT into compact student models for ASR. It systematically compares deep&thin against shallow&wide architectures under unconstrained fine-tuning, and introduces a discriminative loss $L_{disc}$ to complement the traditional regression loss $L_{reg}$, yielding gains in low-resource ASR. Additionally, it proposes distilling the front-end from waveform to a $Fbank$-based pipeline, achieving roughly 17% parameter reduction and about 2× faster inference with only modest degradation, aided by front-end adaptation. The combination of architecture choice, discriminative distillation, and front-end distillation demonstrates practical, scalable improvements for deploying SSL-based ASR in resource-constrained settings, supported by analysis of representation similarity and ablations. These findings offer a path toward efficient, high-performance SSL-based ASR systems in real-world applications.

Abstract

Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling capacity. However, this might limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Miniaturization for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR). First, in order to establish a strong baseline, a comprehensive study on different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios. In addition, we design a simple and effective algorithm to distill the front-end input from waveform to Fbank feature, resulting in 17% parameter reduction and doubling inference speed, at marginal performance degradation.

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

TL;DR

to complement the traditional regression loss

, yielding gains in low-resource ASR. Additionally, it proposes distilling the front-end from waveform to a

-based pipeline, achieving roughly 17% parameter reduction and about 2× faster inference with only modest degradation, aided by front-end adaptation. The combination of architecture choice, discriminative distillation, and front-end distillation demonstrates practical, scalable improvements for deploying SSL-based ASR in resource-constrained settings, supported by analysis of representation similarity and ablations. These findings offer a path toward efficient, high-performance SSL-based ASR systems in real-world applications.

Abstract

Paper Structure (16 sections, 5 equations, 3 figures, 7 tables)

This paper contains 16 sections, 5 equations, 3 figures, 7 tables.

Introduction
Related Works
HuBERT
Distillation of SSL Models
Methods
Exploration of Student Model Structures
Discriminative Loss for SSL Model Distillation
Distillation of Front-end Features
Experiments
Experimental Setup
Results
Details of Replacing Front-end
About the Teacher Model
Feature Analysis
Conclusion
...and 1 more sections

Figures (3)

Figure 1: The distillation process of the two student structures is shown above, where the left side demonstrates the distillation of deep and thin (D&T) model and the right side presents that of shallow and wide (S&W) model with two optional front-ends (waveform or Fbank).
Figure 2: The front-end distillation during the first $N$ steps.
Figure 3: CCA similarity of hidden states between D&T models before and after fine-tuning on low-resource Libri-light fine-tuning datasets (1h & 10h). We choose 100 audio clips in LibriSpeech test-clean dataset for analysis.

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

TL;DR

Abstract

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)