Table of Contents
Fetching ...

A Post-trainer's Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics

Luisa Shimabucoro, Ahmet Ustun, Marzieh Fadaee, Sebastian Ruder

TL;DR

This study investigates cross-lingual transfer dynamics during multilingual instruction tuning for large language models by systematically varying task type, training setting, model size, and multilingual data across 12 languages. It uses two model families up to 35B parameters and three tasks (summarization, instruction following, and mathematical reasoning) with diverse data mixtures, including seen and unseen languages. The findings show that cross-lingual transfer cannot be explained by a single factor: task type, model scale, and training setting interact to shape CLT; larger models exhibit more efficient transfer and smaller seen–unseen gaps, while multi-task training can introduce instability. The work provides practical guidelines on data mixture strategies and language selection to optimize multilingual transfer in realistic post-training regimes, with implications for deploying multilingual LLMs in diverse linguistic contexts.

Abstract

In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.

A Post-trainer's Guide to Multilingual Training Data: Uncovering Cross-lingual Transfer Dynamics

TL;DR

This study investigates cross-lingual transfer dynamics during multilingual instruction tuning for large language models by systematically varying task type, training setting, model size, and multilingual data across 12 languages. It uses two model families up to 35B parameters and three tasks (summarization, instruction following, and mathematical reasoning) with diverse data mixtures, including seen and unseen languages. The findings show that cross-lingual transfer cannot be explained by a single factor: task type, model scale, and training setting interact to shape CLT; larger models exhibit more efficient transfer and smaller seen–unseen gaps, while multi-task training can introduce instability. The work provides practical guidelines on data mixture strategies and language selection to optimize multilingual transfer in realistic post-training regimes, with implications for deploying multilingual LLMs in diverse linguistic contexts.

Abstract

In order for large language models to be useful across the globe, they are fine-tuned to follow instructions on multilingual data. Despite the ubiquity of such post-training, a clear understanding of the dynamics that enable cross-lingual transfer remains elusive. This study examines cross-lingual transfer (CLT) dynamics in realistic post-training settings. We study two model families of up to 35B parameters in size trained on carefully controlled mixtures of multilingual data on three generative tasks with varying levels of complexity (summarization, instruction following, and mathematical reasoning) in both single-task and multi-task instruction tuning settings. Overall, we find that the dynamics of cross-lingual transfer and multilingual performance cannot be explained by isolated variables, varying depending on the combination of post-training settings. Finally, we identify the conditions that lead to effective cross-lingual transfer in practice.

Paper Structure

This paper contains 28 sections, 12 figures, 23 tables.

Figures (12)

  • Figure 1: Overview of our experimental framework. We investigate cross-lingual transfer performance improvements during the instruction tuning stage by varying: 1) task type; 2) fine-tuning setting (single-task or multi-task); 3) quantity of multilingual data; and 4) model size. Runs use a fixed amount of English data and increasing amounts of multilingual data.
  • Figure 2: Average performance across seen and unseen languages relative to English across different tasks for 7B (left) and 35B (right) base models individually (single-task) trained on instruction following (IF), summarization (SM) and mathematical reasoning (MR) datasets. The x-axis indicates the number of samples per non-English language seen during training (es, fr, zh, ja). 7B (left): While IF and summarization results plateau after adding as little as 400 non-English samples per language, MR requires roughly 13x more multilingual data to reach peak performance. 35B (right): We observe similar plateauing behavior as in the smaller model after 200--400 samples across all tasks.
  • Figure 3: Performance changes across runs with gradually increasing amounts of non-English data for 7B and 35B models instruction tuned on a single-task setting on IF, SM and MR tasks, respectively.
  • Figure 4: Average performance across seen and unseen languages relative to English across different tasks for 7B (left) and 35B (right) base models trained jointly (multi-task) on instruction following (IF), summarization (SM) and mathematical reasoning (MR) datasets. The x-axis indicates the number of samples per non-English language seen during training (es, fr, zh, ja). 7B (left): While IF and summarization results manage to reach an average of 90% performance relative to English, MR displays a hectic improvement behavior, reaching only a little over 60% relative performance, which represents a decrease of over 10% when compared to single task results in Figure \ref{['fig:relative-to-english-performance']}. 35B (right): We observe similar plateauing behavior as in the smaller model after 200--400 samples across all tasks, with a very similar performance relative to English for all tasks which seems to narrow even further in the multi-task setting when compard to numbers in Figure \ref{['fig:relative-to-english-performance']}.
  • Figure 5: Performance changes across runs with gradually increasing amounts of non-English data for 7B and 35B models instruction tuned on a multi-task setting on IF, SM and MR tasks, respectively.
  • ...and 7 more figures