Attention-Guided Adaptation for Code-Switching Speech Recognition
Bobbi Aditya, Mahdin Rohmatillah, Liang-Hsuan Tai, Jen-Tzung Chien
TL;DR
The paper tackles code-switching ASR in multilingual settings by analyzing Whisper's decoder attention and identifying LID-related heads. It introduces Attention-Guided Adaptation, which selectively guides these LID-focused heads to attend to the correct language tokens via a ground-truth attention map, using a two-stage adapter training procedure that keeps the backbone fixed. Empirical results on SEAME Mandarin-English show a strong improvement, achieving an overall MER of $14.2\%$ while training only $5.6\%$ of the parameters, outperforming previous state-of-the-art prompts. The work demonstrates that leveraging attention patterns for language identity can yield a parameter-efficient and effective strategy for code-switching ASR, with the potential to generalize to other multilingual scenarios.
Abstract
The prevalence of the powerful multilingual models, such as Whisper, has significantly advanced the researches on speech recognition. However, these models often struggle with handling the code-switching setting, which is essential in multilingual speech recognition. Recent studies have attempted to address this setting by separating the modules for different languages to ensure distinct latent representations for languages. Some other methods considered the switching mechanism based on language identification. In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. This method selects those attention heads in a model which closely express language identities and then guided those heads to be correctly attended with their corresponding languages. The experiments on the Mandarin-English code-switching speech corpus show that the proposed approach achieves a 14.2% mixed error rate, surpassing state-of-the-art method, where only 5.6% additional parameters over Whisper are trained.
