Checkpoint Merging via Bayesian Optimization in LLM Pretraining
Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui
TL;DR
This work tackles the prohibitive costs of pretraining large language models by proposing checkpoint merging during pretraining, using pairwise linear combinations of adjacent checkpoints. A Bayesian optimization framework with Gaussian processes models the merging-performance objective $f(\lambda_t)$ and efficiently locates merging weights via acquisition functions such as EI, PI, and UCB, combined with GP-Hedge for dynamic strategy selection. The approach is supported by PAC-Bayesian generalization bounds, suggesting merged checkpoints can enjoy tighter generalization by effectively regularizing the model. Empirically, the method yields consistent gains across Baichuan2, DeepSeek and Pythia models, across multiple benchmarks (C-Eval, CMMLU, MMLU, GSM8K) and model sizes, while maintaining cross-domain robustness and demonstrating notable efficiency improvements over traditional merging baselines. Overall, checkpoint merging with Bayesian optimization offers a practical, architecture-agnostic path to substantial pretraining improvements with limited additional computation.
Abstract
The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
