Table of Contents
Fetching ...

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

Yaya Sy, Christophe Cerisara, Irina Illina

TL;DR

This work tackles efficient on-device ASR for Whisper in data-scarce languages by introducing BaldWhisper, a two-stage pruning approach that first merges adjacent decoder layers and then applies activation-aware low-rank embedding decomposition to the shared embeddings. The method leverages a data-efficient workflow, requiring only $32$ hours of Bambara data, and uses cross-entropy combined with knowledge distillation to train the compressed model. The results show a 48% reduction in model size and a 2.15x speedup on a MacBook Air M1, while preserving over 90% of the base Whisper's performance, without large retraining data. This approach offers a practical path to deploying accurate, fast ASR for low-resource languages on edge devices and avoids risky vocabulary pruning in code-switching contexts.

Abstract

Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.

BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

TL;DR

This work tackles efficient on-device ASR for Whisper in data-scarce languages by introducing BaldWhisper, a two-stage pruning approach that first merges adjacent decoder layers and then applies activation-aware low-rank embedding decomposition to the shared embeddings. The method leverages a data-efficient workflow, requiring only hours of Bambara data, and uses cross-entropy combined with knowledge distillation to train the compressed model. The results show a 48% reduction in model size and a 2.15x speedup on a MacBook Air M1, while preserving over 90% of the base Whisper's performance, without large retraining data. This approach offers a practical path to deploying accurate, fast ASR for low-resource languages on edge devices and avoids risky vocabulary pruning in code-switching contexts.

Abstract

Pruning large pre-trained transformers for low-resource languages is challenging, as it often requires massive retraining data to recover performance. For instance, Distill-Whisper prunes Whisper by 40% and retrains on 21,000 hours of speech, far beyond what is available for most languages. Can Whisper be made lighter and faster for edge devices in data-scarce settings? Focusing on Bambara with only 32h of speech-to-text data, we propose a new pruning recipe. Instead of vocabulary pruning, which is unsuitable due to frequent code-switching by Bambara speakers, we compress the embeddings with low-rank decomposition and feature distillation. Rather than removing layers, we merge them to limit performance loss. The final model preserves 90% of the original performance while being 48% smaller and 2.15x faster on a MacBook Air M1.

Paper Structure

This paper contains 10 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Layer merging of the decoder layers of Whisper. Each layer of the student decoder is a merge ($\oplus$) of a pair of consecutive layers of the teacher. Then the student is trained on the Cross-Entropy and Knowledge Distillation joint loss.
  • Figure 2: Head shearing of Whisper embedding via low-rank decomposition. The decomposition is activation-aware because the low-rank weights are then feature-distilled.
  • Figure 3: Choice of values values $\alpha$ and $\beta$. The results of Bayesian hyperparameter search suggest that the only constraint is that $\alpha$ should be small.
  • Figure 4: Visualization of the activation similarities between all possible pairs of layers of the decoder. Consecutive pairs of layers tend to be more similar, suggesting that they can be merged.