Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching
Zhuoran Li, Chunming Hu, Junfan Chen, Zhijun Chen, Xiaohui Guo, Richong Zhang
TL;DR
This work tackles zero-shot cross-lingual transfer and the risk that uncontrolled code-switching can degrade multilingual alignment. It introduces Progressive Code-Switching (PCS), a curriculum-based framework that uses an $LRP$-driven word relevance score as a difficulty measure, a temperature-controlled code-switcher, and a dynamic scheduler to gradually incorporate harder code-switched data while mitigating catastrophic forgetting. PCS is evaluated on three cross-lingual tasks (PAWS-X, MLDoc, XTOD) across ten languages with backbones like $mBERT$ and $XLM-R$, achieving state-of-the-art results and demonstrating robust improvements over strong code-switching baselines. The approach enhances cross-lingual representation alignment and offers a practical, scalable way to leverage code-switching data for zero-shot transfer across diverse languages and tasks.
Abstract
Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models' cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.
