Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision
Chih-Kai Yang, Kuan-Po Huang, Ke-Han Lu, Chun-Yi Kuan, Chi-Yuan Hsiao, Hung-yi Lee
TL;DR
The paper addresses zero-shot generalizability of large multilingual foundation models to Mandarin-English code-switching in ASR and ST. It evaluates self-supervised (MMS, SeamlessM4T, SeamlessM4T v2) and weakly supervised (Whisper-large-v3) models, along with variants such as concat-prompt, SICL, and Clairaudience, across three CS corpora using MER and BLEU. Findings show self-supervised models can approach the performance of weakly supervised Whisper, with data scale and model design playing crucial roles, but intra-sentential CS remains challenging; Whisper variants and in-context techniques provide notable gains, highlighting directions for adapting such strategies to self-supervised models. The results underscore the potential of zero-shot CS ASR/ST with scalable pretraining and guided prompting, informing future research on improving code-switching handling without extensive labeled data.
Abstract
This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that these models still have room for improvement as they kept making similar mistakes and had unsatisfactory performances on modeling intra-sentential code-switching. In addition, the validity of several variants of Whisper was explored, and we concluded that they remained effective in a code-switching scenario, and similar techniques for self-supervised models are worth studying to boost the performance of code-switched tasks.
