Table of Contents
Fetching ...

Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment

Benjamin Stahl, Hannes Gamper

TL;DR

This work tackles compressing SSL-based non-intrusive speech quality assessment by applying distillation and pruning to XLS-R-SQA. The authors train a teacher model on a diverse MOS dataset and distill its knowledge into compact convolutional transformer student models using unlabeled degraded speech, with a 20% mix of labeled data found optimal. They also explore data-driven pruning on the teacher's embeddings, finding it benefits larger models, while distillation benefits smaller ones. Together, distillation and pruning enable scalable non-intrusive MOS prediction across more than three orders of magnitude in model size with substantial performance retention.

Abstract

In this paper, we investigate distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. Our experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. We retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. For distillation, using this model as a teacher, we generate pseudo-labels on unlabeled degraded speech signals and train student models of varying sizes. For pruning, we use a data-driven strategy. While data-driven pruning performs better at larger model sizes, distillation on unlabeled data is more effective for smaller model sizes. Distillation can halve the gap between the baseline's correlation with ground-truth MOS labels and that of the XLS-R-based teacher model, while reducing model size by two orders of magnitude compared to the teacher model.

Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment

TL;DR

This work tackles compressing SSL-based non-intrusive speech quality assessment by applying distillation and pruning to XLS-R-SQA. The authors train a teacher model on a diverse MOS dataset and distill its knowledge into compact convolutional transformer student models using unlabeled degraded speech, with a 20% mix of labeled data found optimal. They also explore data-driven pruning on the teacher's embeddings, finding it benefits larger models, while distillation benefits smaller ones. Together, distillation and pruning enable scalable non-intrusive MOS prediction across more than three orders of magnitude in model size with substantial performance retention.

Abstract

In this paper, we investigate distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. Our experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. We retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. For distillation, using this model as a teacher, we generate pseudo-labels on unlabeled degraded speech signals and train student models of varying sizes. For pruning, we use a data-driven strategy. While data-driven pruning performs better at larger model sizes, distillation on unlabeled data is more effective for smaller model sizes. Distillation can halve the gap between the baseline's correlation with ground-truth MOS labels and that of the XLS-R-based teacher model, while reducing model size by two orders of magnitude compared to the teacher model.

Paper Structure

This paper contains 15 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: XLS-R-based speech quality assessment and its usage as a teacher model for distillation using unlabeled speech.
  • Figure 2: Resulting model sizes for student model variants.
  • Figure 3: Weighted average Pearson correlation coefficient as a function of model size with different approaches$^\dagger$.