Table of Contents
Fetching ...

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

TL;DR

BiMix proposes a bivariate data mixing law that jointly models domain proportions and training volume to predict per-domain losses during LLM pretraining. The framework fits with only five coefficients per domain and can extrapolate losses and generalize to unseen mixtures with high fidelity, while enabling direct optimization of domain proportions. It further shows that entropy-based proxies provide efficient, training-free means to construct effective data mixtures. Empirically, BiMix outperforms existing methods in convergence speed and downstream performance on large datasets, offering a practical, scalable tool for data-centric language model scaling.

Abstract

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

TL;DR

BiMix proposes a bivariate data mixing law that jointly models domain proportions and training volume to predict per-domain losses during LLM pretraining. The framework fits with only five coefficients per domain and can extrapolate losses and generalize to unseen mixtures with high fidelity, while enabling direct optimization of domain proportions. It further shows that entropy-based proxies provide efficient, training-free means to construct effective data mixtures. Empirically, BiMix outperforms existing methods in convergence speed and downstream performance on large datasets, offering a practical, scalable tool for data-centric language model scaling.

Abstract

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces , a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate 's high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.
Paper Structure (25 sections, 16 equations, 8 figures, 5 tables)

This paper contains 25 sections, 16 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Visualization of the fitting results for \ref{['eq:bimix_single']} at different domain proportion values, showing the relationship between validation loss and training steps. Each subplot corresponds to a specific domain within different datasets; the points represent the actual observed validation loss, while the dotted lines indicate the fitted results. Both axes are on a logarithmic scale.
  • Figure 2: Visualization of the fitting results for \ref{['eq:bimix_single']} at different numbers of training steps, showing the relationship between validation loss and domain proportion. Each subplot corresponds to a specific domain within different datasets; the points represent the actual observed validation loss, while the dotted lines indicate the fitted results. Both axes are on a logarithmic scale.
  • Figure 3: Correlation between the observed validation losses (x-axis) and the BiMix-predicted losses (y-axis) across training iterations with the Baseline and DoReMi mixtures.
  • Figure 4: Comparison of average downstream accuracy of 1B models trained on different data mixtures. Details regarding specific tasks and the Exact Match metric can be found in \ref{['sec:setup']}.
  • Figure 5: Comparison of log-perplexity evaluations for models trained on different data mixtures.
  • ...and 3 more figures