Table of Contents
Fetching ...

Convexity-based Pruning of Speech Representation Models

Teresa Dorszewski, Lenka Tětková, Lars Kai Hansen

Abstract

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more. Typically, it is found that larger models lead to better performance. However, the significant computational effort involved in such large transformer systems is a challenge for embedded and real-world applications. Recent work has shown that there is significant redundancy in the transformer models for NLP and massive layer pruning is feasible (Sajjad et al., 2023). Here, we investigate layer pruning in audio models. We base the pruning decision on a convexity criterion. Convexity of classification regions has recently been proposed as an indicator of subsequent fine-tuning performance in a range of application domains, including NLP and audio. In empirical investigations, we find a massive reduction in the computational effort with no loss of performance or even improvements in certain cases.

Convexity-based Pruning of Speech Representation Models

Abstract

Speech representation models based on the transformer architecture and trained by self-supervised learning have shown great promise for solving tasks such as speech and speaker recognition, keyword spotting, emotion detection, and more. Typically, it is found that larger models lead to better performance. However, the significant computational effort involved in such large transformer systems is a challenge for embedded and real-world applications. Recent work has shown that there is significant redundancy in the transformer models for NLP and massive layer pruning is feasible (Sajjad et al., 2023). Here, we investigate layer pruning in audio models. We base the pruning decision on a convexity criterion. Convexity of classification regions has recently been proposed as an indicator of subsequent fine-tuning performance in a range of application domains, including NLP and audio. In empirical investigations, we find a massive reduction in the computational effort with no loss of performance or even improvements in certain cases.
Paper Structure (12 sections, 2 figures, 1 table)

This paper contains 12 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Convexity of latent representations of words, phonemes, and speakers. Evaluated for pretrained ( ) and fine-tuned models (word classification: , speaker identification: ), for base models (upper row) and large models (lower row). Models fine-tuned for word classification show increased convexity for word and phoneme representations and decreased convexity for speaker representations, while models fine-tuned for speaker identification show increased convexity for speaker representations and reduced convexity for word and phoneme representations.
  • Figure 2: Accuracies for word classification of the pruned base models (number of layers denoted for each point) vs. the convexity score for words for that layer in the pre-trained model. The best performing pruned model is marked with $\bigstar$, which is layer 8 for all models except ccc-wav2vec2.

Theorems & Definitions (1)

  • Definition : Graph Convexity