BiMix: A Bivariate Data Mixing Law for Language Model Pretraining
Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
TL;DR
BiMix proposes a bivariate data mixing law that jointly models domain proportions and training volume to predict per-domain losses during LLM pretraining. The framework fits with only five coefficients per domain and can extrapolate losses and generalize to unseen mixtures with high fidelity, while enabling direct optimization of domain proportions. It further shows that entropy-based proxies provide efficient, training-free means to construct effective data mixtures. Empirically, BiMix outperforms existing methods in convergence speed and downstream performance on large datasets, offering a practical, scalable tool for data-centric language model scaling.
Abstract
Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.
