Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Jingwei Li; Xinran Gu; Jingzhao Zhang

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Jingwei Li, Xinran Gu, Jingzhao Zhang

TL;DR

A compute-efficient pipeline for data mixture scaling is introduced by introducing a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture and introduces a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model.

Abstract

A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on the target model or rely on mixture scaling laws that fail to extrapolate well to large model sizes. We address these limitations by introducing a compute-efficient pipeline for data mixture scaling. First, we propose CAMEL, a capacity-aware mixture law that models validation loss with the nonlinear interplay between model size and mixture. We also introduce a loss-to-benchmark prediction law that estimates benchmark accuracy from validation loss, enabling end-to-end performance prediction for the target model. Next, we study how to allocate a fixed compute budget across model scales to fit the law and reduce prediction error. Finally, we apply our method to Mixture-of-Experts models with up to 7B-A150M parameters to fit the law, and verify the optimal mixture derived from the law by extrapolating to a 55B-A1.2B target model. Compared to prior methods, we reduces mixture optimization costs by 50\% and improves downstream benchmark performance by up to 3\%.

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

TL;DR

Abstract

Paper Structure (40 sections, 2 theorems, 43 equations, 11 figures, 3 tables)

This paper contains 40 sections, 2 theorems, 43 equations, 11 figures, 3 tables.

Introduction
Data Mixture Scaling Laws
Capacity-Aware Mixture Scaling Laws
Intrinsic domains and mixture-induced domain weights.
Domain-wise loss model.
Capacity allocation objective.
Modeling Downstream Performance
Sampling Strategies
Evaluation on Larger Models
Setup
Model Architecture and Training.
Training Dataset and Mixtures.
Benchmarks.
Experimental Design
Results
...and 25 more sections

Key Result

Theorem 2.1

Assume a:a_similar holds. Solving the optimal allocation ${\bm{\tilde{m}}}^*$ for eq:opt-train-prob, the validation loss can be written as where $C$ captures higher-order terms under a:a_similar and $\alpha_i$, $\beta_i$, and $K_i$ are functions of $\{(A_i,a_i,w_i)\}_{i=1}^k$ in eq:val_loss_model.

Figures (11)

Figure 1: Mixture optimization on the target model under different compute budgets. We evaluate different mixture extrapolation methods by applying them to a larger target model with varying optimization costs. CAMEL, our proposed method, identifies high-quality data mixtures with even less than the cost of one full training pass on the target model. As the optimization budget increases, CAMEL achieves higher average benchmark accuracy than baseline methods while using less than 50% of the compute cost of the baseline.
Figure 2: End-to-end framework for data mixture extrapolation under model scaling. We first fit a loss-to-benchmark mapping to relate validation loss to downstream benchmark accuracy (Section \ref{['sec:benchmark']}). We then model validation loss as a function of model size and data mixtures using sampled $(M, r)$ pairs from smaller models (Section \ref{['sec:model_size']}). These components together enable extrapolation to large models and direct optimization of data mixtures for target-scale performance (Section \ref{['sec:extrapolate']}).
Figure 3: Training loss observations for each domain across model sizes. We train on a mixed dataset of math and knowledge and log the training loss for each domain. While larger models reduce loss in both areas, the rates of reduction differ significantly. This non-uniform scaling implies that the effective parameters allocated to each domain are redistributed dynamically rather than proportionally as the model scales.
Figure 4: Comparison between CAMEL and baseline scaling laws. We compare the fitting error of our proposed Capacity-Aware Mixture Law (CAMEL) with two baseline methods, DML ye2025data and SODM shukor2025scalinglawsoptimaldata. CAMEL achieves consistently lower fitting error and exhibits more stable extrapolation behavior across model scales.
Figure 5: Error of loss-to-benchmark prediction. We model each downstream benchmark accuracy as a function of multiple validation losses. The scatter plots show predicted versus ground-truth scores on training and validation splits. The low prediction error demonstrates that validation losses can reliably predict downstream benchmark accuracy. See Appendix \ref{['app:benchmark']} for details and results on other benchmarks.
...and 6 more figures

Theorems & Definitions (3)

Theorem 2.1
Theorem 4.1
proof

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

TL;DR

Abstract

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (3)