Table of Contents
Fetching ...

Scaling Laws for Optimal Data Mixtures

Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, Pierre Ablin

TL;DR

This work introduces data mixture scaling laws that predict model loss as a function of model size $N$, training tokens $D$, and domain weights $h$ across LLM, NMM, and LVM pretraining. It develops additive and joint formulations, fits them from small-scale runs, and demonstrates accurate extrapolation to large-scale settings and unseen mixtures. By minimizing the fitted law over the simplex with mirror descent, the authors derive optimal domain weights that improve performance relative to naive mixtures, enabling principled data mixture selection under compute budgets. The approach offers a scalable, data-efficient pathway to optimize cross-domain pretraining, with practical implications for faster, cheaper, and more effective foundation models.

Abstract

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.

Scaling Laws for Optimal Data Mixtures

TL;DR

This work introduces data mixture scaling laws that predict model loss as a function of model size , training tokens , and domain weights across LLM, NMM, and LVM pretraining. It develops additive and joint formulations, fits them from small-scale runs, and demonstrates accurate extrapolation to large-scale settings and unseen mixtures. By minimizing the fitted law over the simplex with mirror descent, the authors derive optimal domain weights that improve performance relative to naive mixtures, enabling principled data mixture selection under compute budgets. The approach offers a scalable, data-efficient pathway to optimize cross-domain pretraining, with practical implications for faster, cheaper, and more effective foundation models.

Abstract

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size trained with tokens and a specific domain weight vector . We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget (,), providing a principled alternative to costly trial-and-error methods.

Paper Structure

This paper contains 34 sections, 19 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Scaling Laws for Optimal Data Mixtures. Left: We derive scaling laws that predict the loss of a model as a function of model size N, number of training tokens D, and the domain weights used to train the model (represented by the color of each point). The scaling law is fitted with small-scale runs with different domain weights, and used to predict accurately the loss of large-scale models trained with new, unseen domain weights. Right: We find the data mixture scaling law based on small-scale experiments (e.g., below 1B parameters) and use it to predict the optimal data mixture at larger scales (e.g., 8B parameters). Both our additive (\ref{['eq:first_scaling_law']}) and joint (\ref{['eq:joint_scaling_law']}) laws lead to similar performance, and better than other mixtures (in the gray area). FLOPs are computed as 6ND.
  • Figure 2: Value of the Huber loss (\ref{['eq:huber_loss']}) as a function of the number of L-BFGS calls to fit \ref{['eq:joint_scaling_law']} on the Interleaved domain from the multimodal experiment ($p=1062$ input-target pairs, $k=3$ domains). We repeat 100 random trials, the bold line is the median, and the shaded regions are the $25$-$75$% quantiles. The Basin-hopping method with L-BFGS subroutine converges faster than repeated calls to L-BFGS.
  • Figure 3: Observed vs predicted loss for LLM pretraining on domains from the slimpajama dataset, NMM pretraining with multimodal domains, and LVM pretraining with image-caption domains. The scaling laws are fitted on small-scale models (blue points in the figure) and extrapolated to larger models. We display here the average loss over all domains for each modality. The MRE% for each domain is reported in \ref{['tab:mre_results']}.
  • Figure 4: Losses of the 7B models. After fitting the scaling laws on the small scale runs, we estimate the optimal domain weights $h^*_{avg}$ that minimize the average loss over the training domains (left), and $h^*_{OH}$ that minimizes the loss on the OpenHermes dataset (right). We then train 7B models with these optimal weights, and compare them to two baselines: one with uniform weights, and one with the standard distribution of slimpajama. The losses are averaged over all training domains, and also reported on the OpenHermes dataset. As expected, the model trained with $h^*_{OH}$ performs best on OpenHermes, while the model trained with $h^*_{avg}$ performs best on the training domains.
  • Figure 5: Evolution of optimal domain weights $h^*$ with compute budget $(N,D)$ on the multimodal data, as predicted by the joint scaling law (\ref{['eq:joint_scaling_law']}).
  • ...and 5 more figures