Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
Zhijun Wang, Jiahuan Li, Hao Zhou, Rongxiang Weng, Jingang Wang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang
TL;DR
This study investigates why multilingual LLMs transfer knowledge across languages under highly imbalanced pre-training data and identifies code-switching in the pre-training corpus as a key driver. It develops a systematic measurement of code-switching, categorizes four CS types, and demonstrates that natural CS significantly boosts cross-lingual transfer. To address the scarcity of natural CS, the authors introduce SynCS, a scalable synthetic code-switching framework that injects CS data with controllable density and format, yielding up to 20x efficiency gains over equivalent monolingual data and improved multilingual alignment. The approach generalizes to multiple languages and improves downstream tasks such as translation and zero-shot cross-lingual transfer, though it relies on a relatively small 1.5B model and highlights data-quality considerations for low-resource languages.
Abstract
Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
