Attention-Guided Adaptation for Code-Switching Speech Recognition

Bobbi Aditya; Mahdin Rohmatillah; Liang-Hsuan Tai; Jen-Tzung Chien

Attention-Guided Adaptation for Code-Switching Speech Recognition

Bobbi Aditya, Mahdin Rohmatillah, Liang-Hsuan Tai, Jen-Tzung Chien

TL;DR

The paper tackles code-switching ASR in multilingual settings by analyzing Whisper's decoder attention and identifying LID-related heads. It introduces Attention-Guided Adaptation, which selectively guides these LID-focused heads to attend to the correct language tokens via a ground-truth attention map, using a two-stage adapter training procedure that keeps the backbone fixed. Empirical results on SEAME Mandarin-English show a strong improvement, achieving an overall MER of $14.2\%$ while training only $5.6\%$ of the parameters, outperforming previous state-of-the-art prompts. The work demonstrates that leveraging attention patterns for language identity can yield a parameter-efficient and effective strategy for code-switching ASR, with the potential to generalize to other multilingual scenarios.

Abstract

The prevalence of the powerful multilingual models, such as Whisper, has significantly advanced the researches on speech recognition. However, these models often struggle with handling the code-switching setting, which is essential in multilingual speech recognition. Recent studies have attempted to address this setting by separating the modules for different languages to ensure distinct latent representations for languages. Some other methods considered the switching mechanism based on language identification. In this study, a new attention-guided adaptation is proposed to conduct parameter-efficient learning for bilingual ASR. This method selects those attention heads in a model which closely express language identities and then guided those heads to be correctly attended with their corresponding languages. The experiments on the Mandarin-English code-switching speech corpus show that the proposed approach achieves a 14.2% mixed error rate, surpassing state-of-the-art method, where only 5.6% additional parameters over Whisper are trained.

Attention-Guided Adaptation for Code-Switching Speech Recognition

TL;DR

while training only

of the parameters, outperforming previous state-of-the-art prompts. The work demonstrates that leveraging attention patterns for language identity can yield a parameter-efficient and effective strategy for code-switching ASR, with the potential to generalize to other multilingual scenarios.

Abstract

Paper Structure (11 sections, 7 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 7 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Code-Switching in Whisper
Attention-Guided Adaptation
Attention Head Selection
Attention-Guided Adaptation
Two-Stage Optimization Procedure
Experiments
Experimental Settings
Experimental Results
Experimental Analysis
Conclusions

Figures (3)

Figure 1: (left) Adapters in encoder and decoder of a transformer. (right) Attention-guided loss calculated from decoder.
Figure 2: Examples of three different attention patterns captured by using Whisper model. (left) Self & neighboring attention obtained from head 5 in layer 4. (middle) Special token attention obtained from head 8 in layer 2. (right) LID token attention obtained from head 11 in layer 8.
Figure 3: LID token attention from head 8 in layer 2 using three different models: (left) backbone w/o adapter, (middle) one-stage adapter, and (right) two-stage adapter w/ AG. This comparison shows the attention-guided models correctly attends LID of each word token. $\langle$blnk$\rangle$ stands for blank token.

Attention-Guided Adaptation for Code-Switching Speech Recognition

TL;DR

Abstract

Attention-Guided Adaptation for Code-Switching Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (3)