A Post-trainer's Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics
Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, Sebastian Ruder
TL;DR
This study investigates cross-lingual transfer dynamics during multilingual instruction tuning for large language models by systematically varying task type, training setting, model size, and multilingual data across 12 languages. It uses two model families up to 35B parameters and three tasks (summarization, instruction following, and mathematical reasoning) with diverse data mixtures, including seen and unseen languages. The findings show that cross-lingual transfer cannot be explained by a single factor: task type, model scale, and training setting interact to shape CLT; larger models exhibit more efficient transfer and smaller seen–unseen gaps, while multi-task training can introduce instability. The work provides practical guidelines on data mixture strategies and language selection to optimize multilingual transfer in realistic post-training regimes, with implications for deploying multilingual LLMs in diverse linguistic contexts.
Abstract
In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.
