Table of Contents
Fetching ...

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Xuyang Cao, Qianying Liu, Chuan Xiao, Yusuke Oda, Pontus Stenetorp, Daisuke Kawahara, Makoto Onizuka, Sadao Kurohashi, Shuyuan Zheng

Abstract

In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.

ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Abstract

In multilingual pretraining, the test loss of a pretrained model is heavily influenced by the proportion of each language in the pretraining data, namely the \textit{language mixture ratios}. Multilingual scaling laws can predict the test loss under different language mixture ratios and can therefore be used to estimate the optimal ratios. However, the current approaches to multilingual scaling laws do not measure the \textit{cross-lingual transfer} effect, resulting in suboptimal mixture ratios. In this paper, we consider multilingual pretraining as a cooperative game in which each language acts as a player that jointly contributes to pretraining, gaining the resulting reduction in test loss as the payoff. Consequently, from the perspective of cooperative game theory, we quantify the cross-lingual transfer from each language by its contribution in the game, and propose a game-theoretic multilingual scaling law called \textit{ShapleyLaw}. Our experiments show that ShapleyLaw outperforms baseline methods in model performance prediction and language mixture optimization.
Paper Structure (54 sections, 28 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 54 sections, 28 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) Cross-lingual transfer effects quantified by normalized SVs $\phi^{NSV}_{i,j}$ estimated from small-scale models ($N=50$M, $D=50$B), shown as an asymmetric matrix. In contrast, the FamilyLaw he-etal-2025-scaling baseline yields zero off-diagonal entries. (b) Across diverse pre-training mixtures and model scales $N\in\{50\text{M},243\text{M},684\text{M},1460\text{M}\}$ with total data size $D=100$B, the proposed ShapleyLaw $\mathcal{L}_{j}(N,D,\Theta_{ja})$ accurately fits and predicts Japanese performance.
  • Figure 2: Consider pretraining two models for a Chinese task with training data mixtures in Chinese (zh), Japanese (es), and Spanish (es), which belong to different language families. We first pretrain Model 1 by sampling Chinese, Spanish, and Japanese data with ratios $[0.500,0.495,0.005]$, respectively. We then swap the mixture ratios of Japanese and Spanish to pretrain Model 2. Although Models 1 and 2 share the same model size $N$=684M and data size $D$=100B, they exhibit significantly different losses on the Chinese task.
  • Figure 3: Fitting ShapleyLaw across multilingual mixtures and target languages.Top row: fits with respect to aggregate cross-lingual transfer $\Theta$ on curated multilingual mixtures for Chinese (ZH), Spanish (ES), and Japanese (JA) across four model scales $\{50\text{M}, 243\text{M}, 684\text{M}, 1460\text{M}\}$; circles denote fitting points and yellow diamonds denote held-out evaluation points. Bottom row: fits under randomly sampled mixtures at model scale $N=243$M; circles indicate fitting points, yellow diamonds indicate held-out points, and orange crosses denote removed outliers.
  • Figure 4: Controlled comparison between FamilyLaw and ShapleyLaw on a Chinese evaluation task. We pretrain multilingual models with $N=50$M and $D=100$B using Chinese, Japanese, and Spanish corpora. Starting from the mixture $[0.05, 0.55, 0.40]$, we vary the Spanish--Japanese proportions up to $[0.05, 0.945, 0.05]$ while keeping the Chinese ratio fixed. Performance is evaluated on the Chinese test set. Left: FamilyLaw, where x-axis stands for language ratio, yields a poor fit. Right: ShapleyLaw, where x-axis stands for different Chinese aggregate transfer $\Theta$, fits accurately.
  • Figure 5: Fitting Relationship between Downstream Performance and Test CE Loss. Experiments are conducted on five downstream tasks for English, German, French, and Spanish. Across all tasks, we observe strong negative correlations ($|r| > 0.90$, with an average $R^{2}=0.931$), indicating that lower test loss consistently corresponds to higher downstream performance.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Definition 3.1: Multilingual Pretraining Game (MPG)