Table of Contents
Fetching ...

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Mayee F. Chen, Michael Y. Hu, Nicholas Lourie, Kyunghyun Cho, Christopher Ré

TL;DR

This work introduces Linear Mixing Optimization (LMO), a unifying framework that casts language-model data mixing as minimizing average loss with a data-proportion–loss mixing law. It shows existing offline/online methods correspond to specific parameterizations of the mixing law and solving strategy, highlighting that misplaced parameters drive inconsistent performance. The authors then present Aioli, an online method that learns the mixing-law parameters online (A^t) and updates data proportions accordingly, achieving consistent improvements over stratified baselines across multiple datasets and budget settings. In restricted-budget scenarios, Aioli further boosts performance by dynamically adjusting proportions learned from shorter runs, underscoring the value of true loss–proportion fidelity in data mixing. The results suggest a practical path toward more reliable, efficient data mixing for large-scale LM training.

Abstract

Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.

Aioli: A Unified Optimization Framework for Language Model Data Mixing

TL;DR

This work introduces Linear Mixing Optimization (LMO), a unifying framework that casts language-model data mixing as minimizing average loss with a data-proportion–loss mixing law. It shows existing offline/online methods correspond to specific parameterizations of the mixing law and solving strategy, highlighting that misplaced parameters drive inconsistent performance. The authors then present Aioli, an online method that learns the mixing-law parameters online (A^t) and updates data proportions accordingly, achieving consistent improvements over stratified baselines across multiple datasets and budget settings. In restricted-budget scenarios, Aioli further boosts performance by dynamically adjusting proportions learned from shorter runs, underscoring the value of true loss–proportion fidelity in data mixing. The results suggest a practical path toward more reliable, efficient data mixing for large-scale LM training.

Abstract

Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity. To understand this inconsistency, we unify existing methods into a standard framework, showing they are equivalent to solving a common optimization problem: minimize average loss subject to a method-specific mixing law -- an implicit assumption on the relationship between loss and mixture proportions. This framework suggests that measuring the fidelity of a method's mixing law can offer insights into its performance. Empirically, we find that existing methods set their mixing law parameters inaccurately, resulting in the inconsistent mixing performance we observe. Using this insight, we derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.27 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.

Paper Structure

This paper contains 66 sections, 8 theorems, 16 equations, 11 figures, 26 tables, 4 algorithms.

Key Result

Lemma 1

The EGD update rule for eq:framework subject to $L_{\text{val}, i}^{t+1}(\bm{p}) = c_i^t - b_i^t \sum_{j = 1}^m A_{ij}^t p_j^t \;\forall i\in[m]$ is where $\eta> 0$ is the step size and $Z^t$ is a normalizing constant such that $p_j^{t+1} \in \triangle^m$.

Figures (11)

  • Figure 1: Left: existing methods can be expressed in a unified optimization framework, in which they implicitly assume a linear or log-linear loss-proportion relationship. Center: the (log)-linear parameterizations are well-specified, but existing methods set their parameters incorrectly. Right: Aioli, an online mixing method that more accurately estimates the parameters that capture the true loss-proportion relationship.
  • Figure 2: Left: $p_i$ vs $\log(L_{\text{val}, i}(\bm{p}) - c_i)$ with fitted static log-linear mixing law. Right: $p_i^t$ vs $L_{\text{val}, i}(\bm{p})$ with fitted linear dynamic mixing law. Colors represent random seeds (left) and initial $p^0 \in \mathcal{P}$ (right, blue is ${0.7, 0.3}$). Both laws fit the true loss-proportion relationship well.
  • Figure 3: Improvement over stratified sampling versus optimality of $A^t$. Each dot represents a method applied to a dataset. The red region shows that existing methods are worse than stratified on at least 1 dataset. The vertical dashed line serves as a visual aid.
  • Figure 4: Top: Log-linear static mixing law fit on Books/C4 across 5 random seeds. Bottom: Linear dynamic mixing law fit on Books/C4 on 1 random seed. Each color is a different initial mixture $p^0 \in \mathcal{P}$ trained for $2000$ steps, and the fitting sweeps are done over $100$ additional steps.
  • Figure 5: Residuals plots to check for interactions in the dynamic mixing law experiments with 3 domains (Arxiv, Books, and StackExchange). The target loss is Arxiv. Columns correspond to different initial mixing proportions. Data points show the (externally studentized) residuals of different mixing proportions after fitting the linear mixing law. Top row: Each point in the simplex corresponds to a different mixture of the 3 domains, with its color giving the residual's value at that point (red is positive, blue is negative). Bottom 3 rows: each row shows the residual plotted against a different interaction term: $P_1 P_2$, $P_1 P_3$, and $P_2 P_3$. Dotted gray lines show upper and lower 99% confidence limits for the residuals, assuming the linear regression assumptions hold.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Definition 1
  • Lemma 1
  • Theorem 1
  • Lemma 1
  • proof
  • Proposition 1: Skill-It Derivation
  • proof
  • Proposition 2: DoReMi Derivation
  • proof
  • Proposition 3: DoGE Derivation
  • ...and 5 more