Table of Contents
Fetching ...

How Redundant Is the Transformer Stack in Speech Representation Models?

Teresa Dorszewski, Albert Kjøller Jacobsen, Lenka Tětková, Lars Kai Hansen

TL;DR

The paper analyzes redundancy in transformer stacks for speech representation models by probing layer similarity with three metrics and exploring pruning without retraining. It identifies a robust two-block processing structure and demonstrates substantial pruneability, with up to ~45% layer removal while preserving most performance, and explores knowledge-distillation via mimicking networks to replace the transformer stack. Mimicking networks achieve dramatic reductions in parameters ($95\%$–$98\%$) and inference time ($\approx94\%$) while maintaining around $95\%$+ accuracy, indicating near-total redundancy of the full stack for downstream tasks. These findings support deploying much smaller, faster speech representations on device, enabling practical, resource-constrained automatic speech recognition.

Abstract

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model's predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.

How Redundant Is the Transformer Stack in Speech Representation Models?

TL;DR

The paper analyzes redundancy in transformer stacks for speech representation models by probing layer similarity with three metrics and exploring pruning without retraining. It identifies a robust two-block processing structure and demonstrates substantial pruneability, with up to ~45% layer removal while preserving most performance, and explores knowledge-distillation via mimicking networks to replace the transformer stack. Mimicking networks achieve dramatic reductions in parameters () and inference time () while maintaining around + accuracy, indicating near-total redundancy of the full stack for downstream tasks. These findings support deploying much smaller, faster speech representations on device, enabling practical, resource-constrained automatic speech recognition.

Abstract

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model's predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.
Paper Structure (12 sections, 3 equations, 6 figures, 4 tables)

This paper contains 12 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Conceptual overview of mimicking networks.Left: The general transformer model, consisting of a feature extractor module, the transformer stack and a classification/probing layer. We denote the first part of the network until layer $i$ by $f^{\left( i\right)}$ the remaining part of the network from layer $i$ to $\mathcal{L}$ by $g^{\left( i \right)}$, and the classification layer by $h$. Middle: A 2-layer mimicking network where $m_f^{\left(i\right)}$ and $m_g^{\left(i\right)}$ represent mimicked representations of $f^{\left(i\right)}$ and $g^{\left(i\right)}$, respectively, learned via the step-1-loss. In step 2, the classifier, $\overline{h}$, and mimicking layers are then finetuned for the downstream task. Right: Design of the mimicking layer that maps an input embedding of dimension $d$ to a $z$-dimensional representation before mapping it back to the original shape.
  • Figure 2: Analysis of redundancy of layers using similarity measures and pruning of layers: $(a)$ Similarity between layers of wav2vec2 and wavLM. All three metrics (cosine similarity, CKA, mutual kNN) reveal a block structure. $(b)$ Effect of pruning on performance, using four different pruning objectives. Up to 45% of layers can be pruned while maintaining 95% of accuracy ( ). After pruning most layers, the model performance drops to random chance ( ). Uncertainty ranges cover the empirical 2.5 and 97.5 quantiles obtained from $N=5$ runs. $(c)$ Visualisation of pruned layers on top of the kNN similarity matrix (with $50\%$ performance threshold). Light layers indicate pruned layers, dark layers are still present. Backward and forward pruning (left+middle) only preserve performance as long as both blocks are still present. Pruning based on kNN-BI (right) prunes layers mainly in the first block.
  • Figure 3: Simplification of transformer stack using mimicking networks. Reduction in inference time (up to 87%) and number of parameters (up to 95%) using mimicking networks, while maintaining 95% of the original accuracy ( ). We test transformer and linear layers with different dimensions $z$. Inference time is normalized to 1 (inference time of original model), the pruned model has 3 layers pruned using kNN-BI. These results are for wav2vec2, results for other models are in the appendix.
  • Figure 4: wav2vec2-large
  • Figure 5: wavLM-small
  • ...and 1 more figures