AdaPTwin: Low-Cost Adaptive Compression of Product Twins in Transformers
Emil Biju, Anirudh Sriram, Mert Pilanci
TL;DR
This work tackles the challenge of running large transformer-based ASR on resource-constrained devices while requiring speaker-specific personalization. It introduces AdaPTwin, a low-rank adaptive compression that jointly compresses product-dependent weight pairs in the transformer's attention and feed-forward layers, augmented with LoRA and layer-wise fine-tuning. The approach achieves up to 45% encoder compression with target-speaker WER increases under 2% and LibriSpeech generalization within 2.2%, using only about 8 hours of data and less than 20 minutes per model when parallelized, outperforming many distillation- and pruning-based methods in data efficiency. These results highlight AdaPTwin's potential for efficient, private, edge-ready on-device ASR with flexible trade-offs and scalable compression through optional quantization.
Abstract
While large transformer-based models have exhibited remarkable performance in speaker-independent speech recognition, their large size and computational requirements make them expensive or impractical to use in resource-constrained settings. In this work, we propose a low-rank adaptive compression technique called AdaPTwin that jointly compresses product-dependent pairs of weight matrices in the transformer attention layer. Our approach can prioritize the compressed model's performance on a specific speaker while maintaining generalizability to new speakers and acoustic conditions. Notably, our technique requires only 8 hours of speech data for fine-tuning, which can be accomplished in under 20 minutes, making it highly cost-effective compared to other compression methods. We demonstrate the efficacy of our approach by compressing the Whisper and Distil-Whisper models by up to 45% while incurring less than a 2% increase in word error rate.
