Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding
Jiahui Zhao, Hao Shi, Chenrui Cui, Tianrui Wang, Hexin Liu, Zhaoheng Ni, Lingxuan Ye, Longbiao Wang
TL;DR
This work tackles code-switching ASR by adapting the Whisper model through two major components: an encoder refiner that enhances intra-sentence language switching capture via an LSTM-based refinement guided by CTC, and a language-aware decoding scheme that uses dual language prompts and adapters per decoder layer with a fusion module to produce a language-aware final output. The approach yields complementary gains, achieving relative MER reductions of 4.1% on dev_man and 7.2% on dev_sge on the SEAME dataset, and outperforms prior state-of-the-art CS-ASR methods. The results demonstrate effective, parameter-efficient adaptation of a large multilingual model to CS tasks, with particularly strong improvements for non-native language regions and practical implications for multilingual communities.
Abstract
Code-switching (CS) automatic speech recognition (ASR) faces challenges due to the language confusion resulting from accents, auditory similarity, and seamless language switches. Adaptation on the pre-trained multi-lingual model has shown promising performance for CS-ASR. In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. First, we propose an encoder refiner to enhance the encoder's capacity of intra-sentence swithching. Second, we propose using two sets of language-aware adapters with different language prompt embeddings to achieve language-specific decoding information in each decoder layer. Then, a fusion module is added to fuse the language-aware decoding. The experimental results using the SEAME dataset show that, compared with the baseline model, the proposed approach achieves a relative MER reduction of 4.1% and 7.2% on the dev_man and dev_sge test sets, respectively, surpassing state-of-the-art methods. Through experiments, we found that the proposed method significantly improves the performance on non-native language in CS speech, indicating that our approach enables Whisper to better distinguish between the two languages.
