The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

Zhiliang Chen; Alfred Wei Lun Leong; Shao Yong Ong; Apivich Hemachandram; Gregory Kang Ruey Lau; Chuan-Sheng Foo; Zhengyuan Liu; Nancy F. Chen; Bryan Kian Hsiang Low

The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

Zhiliang Chen, Alfred Wei Lun Leong, Shao Yong Ong, Apivich Hemachandram, Gregory Kang Ruey Lau, Chuan-Sheng Foo, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

TL;DR

This work tackles the interdependent problem of jointly optimizing data mixtures and model configurations for LLMs within a fixed training budget. It introduces JoBS, which pairs a scaling-law-inspired performance predictor with Bayesian optimization to amortize the cost of full-training runs and enable many more optimization iterations. The authors derive a theoretical regret bound that captures the predictor error and budget tradeoffs, and empirically show JoBS outperforms independent data/model optimization and multi-fidelity BO baselines across multiple tasks and model families. The approach demonstrates practical gains by exploiting interactions between data and architecture and offers a principled budget-allocation strategy with broad applicability to LLM fine-tuning. Overall, JoBS provides a scalable framework for jointly optimizing data and model components, with solid theoretical backing and demonstrated empirical advantages.

Abstract

Co-optimizing data and model configurations for training LLMs presents a classic chicken-and-egg dilemma: The best training data configuration (e.g., data mixture) for a downstream task depends on the chosen model configuration (e.g., model architecture), and vice versa. However, jointly optimizing both data and model configurations is often deemed intractable, and existing methods focus on either data or model optimization without considering their interaction. We introduce JoBS, an approach that uses a scaling-law-inspired performance predictor to aid Bayesian optimization (BO) in jointly optimizing LLM training data and model configurations efficiently. JoBS allocates a portion of the optimization budget to learn an LLM performance predictor that predicts how promising a training configuration is from a small number of training steps. The remaining budget is used to perform BO entirely with the predictor, effectively amortizing the cost of running full-training runs. We study JoBS's average regret and devise the optimal budget allocation to minimize regret. JoBS outperforms existing multi-fidelity BO baselines, as well as data and model optimization approaches across diverse LLM tasks under the same optimization budget.

The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

TL;DR

Abstract

Paper Structure (27 sections, 1 theorem, 13 equations, 8 figures, 11 tables)

This paper contains 27 sections, 1 theorem, 13 equations, 8 figures, 11 tables.

Introduction
Related Work
Problem Setup
Preliminary Findings
Performance Landscape
Amortization by Predicting Performance
Introducing JoBS
BO as the Backbone of JoBS
Amortization with Performance Predictor
Prediction Error and Performance Tradeoffs
Theoretical Analysis of JoBS
Experiments
Baselines
Comparison with Data and Model Optimization Approaches
Comparison with Multi-Fidelity BO (MF-BO) Approaches
...and 12 more sections

Key Result

Lemma 3.1

Let $||f||_{\kappa}=\sqrt{ \langle f,f \rangle_\kappa} \leq \mathcal{B}$. Also, assume that the observation noise associated with each BO iteration is $R$-sub-Gaussian with $R>0$. Then with probability at least $1-\delta$, the following holds for BO iteration $t \leq T$: where $\gamma_{t}$ is the maximum information gain after $t$ observations and $\mu_t(x), \sigma_t^2(x)$ are mean and variance o

Figures (8)

Figure 1: JoBS optimally balances budget between learning a performance predictor and running Bayesian Optimization iterations.
Figure 2: LLM performance varies with different configurations.
Figure 3: A neural network can be trained to predict LLM performance from small training steps.
Figure 4: Prediction error of the predictor in JoBS w.r.t. different number of initial full-training runs $N$
Figure 5: Comparison of LLM performance of best-found configuration at each iteration of JoBS as compared with other BO-centric approaches under the same total optimization budget of $50000$ training steps, across different language tasks. JoBS's plot begins later because it allocates an initial budget to learn a performance predictor. Optimization budget is 50000 training steps.
...and 3 more figures

Theorems & Definitions (2)

proof
Lemma 3.1

The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

TL;DR

Abstract

The Chicken and Egg Dilemma: Co-optimizing Data and Model Configurations for LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (2)