Table of Contents
Fetching ...

AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers

Emil Biju, Anirudh Sriram, Mert Pilanci

TL;DR

This work tackles the challenge of running large transformer-based ASR on resource-constrained devices while requiring speaker-specific personalization. It introduces AdaPTwin, a low-rank adaptive compression that jointly compresses product-dependent weight pairs in the transformer's attention and feed-forward layers, augmented with LoRA and layer-wise fine-tuning. The approach achieves up to 45% encoder compression with target-speaker WER increases under 2% and LibriSpeech generalization within 2.2%, using only about 8 hours of data and less than 20 minutes per model when parallelized, outperforming many distillation- and pruning-based methods in data efficiency. These results highlight AdaPTwin's potential for efficient, private, edge-ready on-device ASR with flexible trade-offs and scalable compression through optional quantization.

Abstract

While large transformer-based models have exhibited remarkable performance in speaker-independent speech recognition, their large size and computational requirements make them expensive or impractical to use in resource-constrained settings. In this work, we propose a low-rank adaptive compression technique called AdaPTwin that jointly compresses product-dependent pairs of weight matrices in the transformer attention layer. Our approach can prioritize the compressed model's performance on a specific speaker while maintaining generalizability to new speakers and acoustic conditions. Notably, our technique requires only 8 hours of speech data for fine-tuning, which can be accomplished in under 20 minutes, making it highly cost-effective compared to other compression methods. We demonstrate the efficacy of our approach by compressing the Whisper and Distil-Whisper models by up to 45% while incurring less than a 2% increase in word error rate.

AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers

TL;DR

This work tackles the challenge of running large transformer-based ASR on resource-constrained devices while requiring speaker-specific personalization. It introduces AdaPTwin, a low-rank adaptive compression that jointly compresses product-dependent weight pairs in the transformer's attention and feed-forward layers, augmented with LoRA and layer-wise fine-tuning. The approach achieves up to 45% encoder compression with target-speaker WER increases under 2% and LibriSpeech generalization within 2.2%, using only about 8 hours of data and less than 20 minutes per model when parallelized, outperforming many distillation- and pruning-based methods in data efficiency. These results highlight AdaPTwin's potential for efficient, private, edge-ready on-device ASR with flexible trade-offs and scalable compression through optional quantization.

Abstract

While large transformer-based models have exhibited remarkable performance in speaker-independent speech recognition, their large size and computational requirements make them expensive or impractical to use in resource-constrained settings. In this work, we propose a low-rank adaptive compression technique called AdaPTwin that jointly compresses product-dependent pairs of weight matrices in the transformer attention layer. Our approach can prioritize the compressed model's performance on a specific speaker while maintaining generalizability to new speakers and acoustic conditions. Notably, our technique requires only 8 hours of speech data for fine-tuning, which can be accomplished in under 20 minutes, making it highly cost-effective compared to other compression methods. We demonstrate the efficacy of our approach by compressing the Whisper and Distil-Whisper models by up to 45% while incurring less than a 2% increase in word error rate.
Paper Structure (19 sections, 10 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 10 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The left image shows the computational flow within a standard transformer layer, while the right image depicts the layer with AdaPTwin replacements. The per-head parameters $(W^T_{Q_h}, W_{K_h})$ and $(W^T_{V_h}, W^T_{O_h})$ are product twins and are jointly compressed using the SVD of their products while $W_{FC_1}$ and $W_{FC_2}$ are compressed independently using the SVD of each matrix.
  • Figure 2: WER comparison of Whisper and Distil-Whisper models on LJSpeech with varying levels of encoder compression, while the decoder remains uncompressed. The level of compression is increased by compressing successive encoder layers (starting from the first layer) while maintaining consistent spectral and LoRA ranks. Larger models show greater encoder compressibility.
  • Figure 3: WER of Whisper base upon compressing all layers with and without quantization.