Table of Contents
Fetching ...

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

Thomson Yen, Andrew Wei Tung Siah, Haozhe Chen, Tianyi Peng, Daniel Guetta, Hongseok Namkoong

TL;DR

This work tackles data-mixture optimization for large language model pretraining by framing it as a probabilistic extrapolation problem that accounts for uncertainty in data composition, model scale, and training duration. It introduces Multi-Fidelity Multi-Scale Bayesian Optimization (MFMS-BO) and a simple MFMS-GP surrogate to efficiently search over data mixtures $\boldsymbol{w}$, model size $m$, and training steps $z$, using cost-aware acquisition such as EI per unit cost. An empirical testbed built from 472 pretraining runs demonstrates that MFMS-BO can find high-performing configurations 2.6x to 3.3x faster than baselines like Hyperband and Random Search, with smaller models and partial training runs providing valuable predictive signals. The results suggest substantial efficiency gains for principled data-mixture optimization and point toward extensions to broader data-selection tasks and more sophisticated kernels and acquisition strategies for future work.

Abstract

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a $\textit{probabilistic extrapolation framework}$ for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem$\unicode{x2013}$multi-fidelity, multi-scale Bayesian optimization$\unicode{x2013}$where $\{$data mixtures, model scale, training steps$\}$ are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve $\textbf{2.6x}$ and $\textbf{3.3x}$ speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

TL;DR

This work tackles data-mixture optimization for large language model pretraining by framing it as a probabilistic extrapolation problem that accounts for uncertainty in data composition, model scale, and training duration. It introduces Multi-Fidelity Multi-Scale Bayesian Optimization (MFMS-BO) and a simple MFMS-GP surrogate to efficiently search over data mixtures , model size , and training steps , using cost-aware acquisition such as EI per unit cost. An empirical testbed built from 472 pretraining runs demonstrates that MFMS-BO can find high-performing configurations 2.6x to 3.3x faster than baselines like Hyperband and Random Search, with smaller models and partial training runs providing valuable predictive signals. The results suggest substantial efficiency gains for principled data-mixture optimization and point toward extensions to broader data-selection tasks and more sophisticated kernels and acquisition strategies for future work.

Abstract

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problemmulti-fidelity, multi-scale Bayesian optimizationwhere data mixtures, model scale, training steps are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve and speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.

Paper Structure

This paper contains 18 sections, 2 equations, 7 figures, 3 tables, 2 algorithms.

Figures (7)

  • Figure 1: Left: The predicted validation cross-entropy loss on ArXiv data Shen_Slimpajama as a function of data mixing coefficient and model sizes from a data-driven predictor on 472 runs (see details in Sec. \ref{['sec:motivation']}). Notice the highly non-smooth geometry. Orange dots highlight the optimal data mixture proportion for each model scale. Note that they are not consistent across scales. Middle: The curvature at these points shows there are points of high irregularities, suggesting that the relationship between data mixture and model performance is unlikely to take a simple functional form. Right: A demonstration showing how functional forms like exponential decay fitted on a small number of points would result in a high predictive error. In contrast, a probabilistic model such as a Gaussian Process can capture uncertainty over the points.
  • Figure 2: Our multi-fidelity multi-scale Bayesian optimization framework. (a) Given an unknown optimal training data distribution, (b) existing methods use heuristic-based filtering techniques to approximate the optimal distribution. (c) Our algorithm treats data mixture optimization as a Bayesian Optimization problem. (d) We explore data mixtures in a cost-aware fashion; when we test a new data mixture, we also choose the fidelity of the observation we will observe. Larger models trained for more steps will result in high fidelity observations, but be more expensive. Every point we observe updates our probabilistic belief of model performance over the data mixture, model size, and training steps space, which guides subsequent parameters.
  • Figure 3: $R^2$ values of the experiments listed in Table \ref{['model_size_exp_setup']}, averaged over 3 random seeds. Notice that $E_2 > E_1$ and $E_5 > E_4$ -- our ability to predict the performance of larger models is considerably enhanced by insights from smaller models. Note also that $E_3 \approx E_2$; adding information about much smaller models does not seem to help.
  • Figure 3: On maximizing accuracy in the downstream tasks, our multi-scale multi-fidelity approach achieves more than 2.6x speedup and finds the best configuration the fastest.
  • Figure 4: On minimizing the validation cross-entropy losses, our multi-scale multi-fidelity approach achieves more than 2.7x speedup and finds the best configuration the fastest.
  • ...and 2 more figures