Selective Attention Merging for low resource tasks: A case study of Child ASR
Natarajan Balaji Shankar, Zilai Wang, Eray Eren, Abeer Alwan
TL;DR
The paper tackles the challenge of applying Speech Foundation Models to low-resource child ASR by leveraging model merging, introducing Selective Attention Merge (SA Merge) that selectively combines task vectors from attention layers with layer-dependent weighting. SA Merge emphasizes lower-layer acoustic representations while incorporating higher-layer linguistic knowledge from larger adult-speech models, yielding consistent improvements over standard merging methods. The authors demonstrate a significant relative WER reduction (up to 14%) and achieve a state-of-the-art WER of 8.69 on the MyST dataset for Whisper-small when SA Merge is combined with data augmentation. They further show that task-vector representations learned from augmentations can transfer across datasets and that augmentation vectors align differently, suggesting complementary strategies for robust low-resource child ASR. Overall, SA Merge shows strong potential for improving child ASR performance in resource-constrained environments and invites future exploration across other low-resource domains.
Abstract
While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.
