Table of Contents
Fetching ...

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Luca Zampierin, Ghouthi Boukli Hacene, Bac Nguyen, Mirco Ravanelli

TL;DR

This paper introduces SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network, and demonstrates that the distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

Abstract

Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

TL;DR

This paper introduces SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network, and demonstrates that the distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.

Abstract

Self-supervised learning (SSL) has achieved remarkable success across various speech-processing tasks. To enhance its efficiency, previous works often leverage the use of compression techniques. A notable recent attempt is DPHuBERT, which applies joint knowledge distillation (KD) and structured pruning to learn a significantly smaller SSL model. In this paper, we contribute to this research domain by introducing SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network. The identification of the layers to distill is achieved through a hierarchical clustering procedure applied to layer similarity measures. Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class across several SUPERB tasks.
Paper Structure (14 sections, 4 equations, 4 figures, 3 tables)

This paper contains 14 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Average cosine and $\ell_1$ distance between WavLM Base+ chen2022wavlm and its distilled version DPWavLM peng23c_interspeech.
  • Figure 2: Comparison between DPHuBERT peng23c_interspeech and SKILL (ours) distillation strategies. (a) DPHuBERT distills only a set of pre-selected layers. (b) In SKILL, teacher layers are first clustered based on their similarity evaluated on a calibration dataset. The distillation is then performed on average representations of these clusters. (Dashed modules indicate the prunable parameters during stage 1 of training).
  • Figure 3: Layer-wise CKA similarity for WavLM Base+ chen2022wavlm (a) and HuBERT Base hsu2021hubert (b). Darker colors indicate higher similarity.
  • Figure 4: Comparison of SKWavLM with DPWavLM peng23c_interspeech on PR, ASR, SID, and SD at different target sparsities.