SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Luca Zampierin; Ghouthi Boukli Hacene; Bac Nguyen; Mirco Ravanelli

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Luca Zampierin, Ghouthi Boukli Hacene, Bac Nguyen, Mirco Ravanelli

TL;DR

This paper introduces SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network, and demonstrates that the distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

Abstract

Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 4 figures, 3 tables)

This paper contains 14 sections, 4 equations, 4 figures, 3 tables.

Introduction
Background
Knowledge distillation
Structured pruning
Proposed Method
Similarity-based clustering
Similarity-aware knowledge distillation
Experimental setup
Results
Limitations of DPHuBERT
SUPERB performance
Sparsity analysis
Ablation study
Conclusion

Figures (4)

Figure 1: Average cosine and $\ell_1$ distance between WavLM Base+ chen2022wavlm and its distilled version DPWavLM peng23c_interspeech.
Figure 2: Comparison between DPHuBERT peng23c_interspeech and SKILL (ours) distillation strategies. (a) DPHuBERT distills only a set of pre-selected layers. (b) In SKILL, teacher layers are first clustered based on their similarity evaluated on a calibration dataset. The distillation is then performed on average representations of these clusters. (Dashed modules indicate the prunable parameters during stage 1 of training).
Figure 3: Layer-wise CKA similarity for WavLM Base+ chen2022wavlm (a) and HuBERT Base hsu2021hubert (b). Darker colors indicate higher similarity.
Figure 4: Comparison of SKWavLM with DPWavLM peng23c_interspeech on PR, ASR, SID, and SD at different target sparsities.

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

TL;DR

Abstract

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)