Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

Thomson Yen; Andrew Wei Tung Siah; Haozhe Chen; Tianyi Peng; Daniel Guetta; Hongseok Namkoong

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

Thomson Yen, Andrew Wei Tung Siah, Haozhe Chen, Tianyi Peng, Daniel Guetta, Hongseok Namkoong

TL;DR

This work tackles data-mixture optimization for large language model pretraining by framing it as a probabilistic extrapolation problem that accounts for uncertainty in data composition, model scale, and training duration. It introduces Multi-Fidelity Multi-Scale Bayesian Optimization (MFMS-BO) and a simple MFMS-GP surrogate to efficiently search over data mixtures $\boldsymbol{w}$, model size $m$, and training steps $z$, using cost-aware acquisition such as EI per unit cost. An empirical testbed built from 472 pretraining runs demonstrates that MFMS-BO can find high-performing configurations 2.6x to 3.3x faster than baselines like Hyperband and Random Search, with smaller models and partial training runs providing valuable predictive signals. The results suggest substantial efficiency gains for principled data-mixture optimization and point toward extensions to broader data-selection tasks and more sophisticated kernels and acquisition strategies for future work.

Abstract

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a $\textit{probabilistic extrapolation framework}$ for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem$\unicode{x2013}$multi-fidelity, multi-scale Bayesian optimization$\unicode{x2013}$where $\{$data mixtures, model scale, training steps$\}$ are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve $\textbf{2.6x}$ and $\textbf{3.3x}$ speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

TL;DR

, model size

, and training steps

, using cost-aware acquisition such as EI per unit cost. An empirical testbed built from 472 pretraining runs demonstrates that MFMS-BO can find high-performing configurations 2.6x to 3.3x faster than baselines like Hyperband and Random Search, with smaller models and partial training runs providing valuable predictive signals. The results suggest substantial efficiency gains for principled data-mixture optimization and point toward extensions to broader data-selection tasks and more sophisticated kernels and acquisition strategies for future work.

Abstract

for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem

multi-fidelity, multi-scale Bayesian optimization

where

data mixtures, model scale, training steps

are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve

and

speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

TL;DR

Abstract

Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)