Measuring Cross-lingual Transfer in Bytes

Leandro Rodrigues de Souza; Thales Sales Almeida; Roberto Lotufo; Rodrigo Nogueira

Measuring Cross-lingual Transfer in Bytes

Leandro Rodrigues de Souza, Thales Sales Almeida, Roberto Lotufo, Rodrigo Nogueira

TL;DR

This work introduces a Data Transfer metric $D_T$ to quantify bytes-based cross-lingual knowledge transfer in a byte-vocabulary, decoder-only Transformer setting. By pretraining on one language and finetuning on another, it demonstrates that transfer magnitudes are broadly similar across diverse source languages, supporting the existence of language-agnostic representations. The study finds weak evidence that language contamination or linguistic proximity primarily drive transfer, and reveals non-commutative transfer patterns likely due to dataset heterogeneity. These findings offer a principled way to measure language-agnostic knowledge in pretraining and suggest avenues for broader, more controlled evaluations across more languages and tasks.

Abstract

Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.

Measuring Cross-lingual Transfer in Bytes

TL;DR

This work introduces a Data Transfer metric

to quantify bytes-based cross-lingual knowledge transfer in a byte-vocabulary, decoder-only Transformer setting. By pretraining on one language and finetuning on another, it demonstrates that transfer magnitudes are broadly similar across diverse source languages, supporting the existence of language-agnostic representations. The study finds weak evidence that language contamination or linguistic proximity primarily drive transfer, and reveals non-commutative transfer patterns likely due to dataset heterogeneity. These findings offer a principled way to measure language-agnostic knowledge in pretraining and suggest avenues for broader, more controlled evaluations across more languages and tasks.

Abstract

Paper Structure (25 sections, 2 equations, 4 figures, 7 tables)

This paper contains 25 sections, 2 equations, 4 figures, 7 tables.

Introduction
Related Work
Methodology
Data Transfer Estimation
Task and Evaluation Metric
Tokenization Impact
Language Contamination
Language Similarity
Experiments
Languages
Datasets
Model Architecture
Training details
Results
Performance with different initializations
...and 10 more sections

Figures (4)

Figure 1: Example illustrating how the coeficients $D_T$, $D_F$ and $D_E$ are calculated. Each series represents a different initialization. $D_T$ is the number of additional tokens in the target language that a from-scratch model would have needed to achieve the same perplexity of a model finetuned from English. $D_F$ is the size of the dataset used for finetuning and $D_E$ accounts for all data, both $D_F$ and $D_T$.
Figure 2: Results measured in Perplexity per token for three target languages. Each series represents a different initialization: train from scratch, finetune from an English, Chinese, or Russian model.
Figure 3: Dispersion chart for Data Transfer ($D_T$) across target languages. Each series corresponds to a distinct source language. The first dashed line (top-to-bottom) indicates the average of the best results (higher transfer), while the second one represents the average of the worst results (lower transfer).
Figure 4: Boxplot with Data Transfer results for the 6 million tokens datasets in all target languages.

Measuring Cross-lingual Transfer in Bytes

TL;DR

Abstract

Measuring Cross-lingual Transfer in Bytes

Authors

TL;DR

Abstract

Table of Contents

Figures (4)