Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition
Aditya Srinivas Menon, Kumud Tripathi, Raj Gohil, Pankaj Wasnik
TL;DR
This work addresses the high computational cost of self-attention in self-supervised speech models by introducing Windowed SummaryMixing (WSM), which adds a local neighborhood context to the global summary while maintaining linear-time complexity. A selective fine-tuning strategy further reduces resource demands by replacing only the last few self-attention layers with WSM blocks and tuning just these layers in low-resource settings. Empirical results across monolingual and multilingual SSL models show WSM achieving lower WER/CER with substantial VRAM savings and faster inference, demonstrating strong potential for efficient, real-time ASR in low-resource scenarios. The approach is shown to be scalable across languages and datasets, offering a practical path toward efficient SSL fine-tuning for speech recognition tasks.
Abstract
Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40\% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.
