Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

Zhuoran Li; Chunming Hu; Junfan Chen; Zhijun Chen; Xiaohui Guo; Richong Zhang

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

Zhuoran Li, Chunming Hu, Junfan Chen, Zhijun Chen, Xiaohui Guo, Richong Zhang

TL;DR

This work tackles zero-shot cross-lingual transfer and the risk that uncontrolled code-switching can degrade multilingual alignment. It introduces Progressive Code-Switching (PCS), a curriculum-based framework that uses an $LRP$-driven word relevance score as a difficulty measure, a temperature-controlled code-switcher, and a dynamic scheduler to gradually incorporate harder code-switched data while mitigating catastrophic forgetting. PCS is evaluated on three cross-lingual tasks (PAWS-X, MLDoc, XTOD) across ten languages with backbones like $mBERT$ and $XLM-R$, achieving state-of-the-art results and demonstrating robust improvements over strong code-switching baselines. The approach enhances cross-lingual representation alignment and offers a practical, scalable way to leverage code-switching data for zero-shot transfer across diverse languages and tasks.

Abstract

Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word representations. However, uncontrolled and over-replaced code-switching would augment dirty samples to model training. In other words, the excessive code-switching text samples will negatively hurt the models' cross-lingual transferability. To this end, we propose a Progressive Code-Switching (PCS) method to gradually generate moderately difficult code-switching examples for the model to discriminate from easy to hard. The idea is to incorporate progressively the preceding learned multilingual knowledge using easier code-switching data to guide model optimization on succeeding harder code-switching data. Specifically, we first design a difficulty measurer to measure the impact of replacing each word in a sentence based on the word relevance score. Then a code-switcher generates the code-switching data of increasing difficulty via a controllable temperature variable. In addition, a training scheduler decides when to sample harder code-switching data for model training. Experiments show our model achieves state-of-the-art results on three different zero-shot cross-lingual transfer tasks across ten languages.

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

TL;DR

-driven word relevance score as a difficulty measure, a temperature-controlled code-switcher, and a dynamic scheduler to gradually incorporate harder code-switched data while mitigating catastrophic forgetting. PCS is evaluated on three cross-lingual tasks (PAWS-X, MLDoc, XTOD) across ten languages with backbones like

and

, achieving state-of-the-art results and demonstrating robust improvements over strong code-switching baselines. The approach enhances cross-lingual representation alignment and offers a practical, scalable way to leverage code-switching data for zero-shot transfer across diverse languages and tasks.

Abstract

Paper Structure (33 sections, 6 equations, 5 figures, 6 tables)

This paper contains 33 sections, 6 equations, 5 figures, 6 tables.

Introduction
Related Work
Zero-shot cross-lingual transfer
Curriculum learning
Progressive Code-Switching
Problem Formulation.
Difficulty Measurer
Code-Switcher
Scheduler
Model Trainer
Experiments
Setup
Tasks and Datasets.
Implementation Details.
Performance Comparison
...and 18 more sections

Figures (5)

Figure 1: Illustration of our progressive code-switching cross-lingual idea. (a) Direct transfer from source to target. (b) Randomly generating code-switching data. (c) The proposed progressive code-switching method generates code-switching data for the model to discriminate from easy to hard. Larger and darker dots indicate harder code-switching data.
Figure 2: The left subfigure provides an overview of our proposed progressive code-switching framework, while the right subfigure illustrates the three key components. (i) The difficulty measurer calculates the relevance scores to estimate the contribution of each word in the source language data towards the prediction; (ii) The code-switcher selects substitution words based on the relevance score to generate suitable code-switching data; (iii) The scheduler decides when to sample harder code-switching examples for model training. $D_{EN}$: the labelled data in the source language; $D^{(k)}_{CS}$: the generated code-switching data in the $k$-th curriculum; $M^{(k)}$: the learned model for target languages in the $k$-th curriculum.
Figure 3: A darker colour indicates a higher cosine similarity score between source words in the original sentence and corresponding target words in the code-switching sentence.
Figure 4: Learning curves of our PCS and three baseline models on PAWS-X based on mBERT.
Figure 5: Multilingual alignment t-SNE visualization. Sentence embeddings from fine-tuned mBERT and our PCS.

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

TL;DR

Abstract

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

Authors

TL;DR

Abstract

Table of Contents

Figures (5)