Table of Contents
Fetching ...

Checkpoint Merging via Bayesian Optimization in LLM Pretraining

Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui

TL;DR

This work tackles the prohibitive costs of pretraining large language models by proposing checkpoint merging during pretraining, using pairwise linear combinations of adjacent checkpoints. A Bayesian optimization framework with Gaussian processes models the merging-performance objective $f(\lambda_t)$ and efficiently locates merging weights via acquisition functions such as EI, PI, and UCB, combined with GP-Hedge for dynamic strategy selection. The approach is supported by PAC-Bayesian generalization bounds, suggesting merged checkpoints can enjoy tighter generalization by effectively regularizing the model. Empirically, the method yields consistent gains across Baichuan2, DeepSeek and Pythia models, across multiple benchmarks (C-Eval, CMMLU, MMLU, GSM8K) and model sizes, while maintaining cross-domain robustness and demonstrating notable efficiency improvements over traditional merging baselines. Overall, checkpoint merging with Bayesian optimization offers a practical, architecture-agnostic path to substantial pretraining improvements with limited additional computation.

Abstract

The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.

Checkpoint Merging via Bayesian Optimization in LLM Pretraining

TL;DR

This work tackles the prohibitive costs of pretraining large language models by proposing checkpoint merging during pretraining, using pairwise linear combinations of adjacent checkpoints. A Bayesian optimization framework with Gaussian processes models the merging-performance objective and efficiently locates merging weights via acquisition functions such as EI, PI, and UCB, combined with GP-Hedge for dynamic strategy selection. The approach is supported by PAC-Bayesian generalization bounds, suggesting merged checkpoints can enjoy tighter generalization by effectively regularizing the model. Empirically, the method yields consistent gains across Baichuan2, DeepSeek and Pythia models, across multiple benchmarks (C-Eval, CMMLU, MMLU, GSM8K) and model sizes, while maintaining cross-domain robustness and demonstrating notable efficiency improvements over traditional merging baselines. Overall, checkpoint merging with Bayesian optimization offers a practical, architecture-agnostic path to substantial pretraining improvements with limited additional computation.

Abstract

The rapid proliferation of large language models (LLMs) such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
Paper Structure (70 sections, 2 theorems, 88 equations, 7 figures, 10 tables)

This paper contains 70 sections, 2 theorems, 88 equations, 7 figures, 10 tables.

Key Result

Theorem 1

Under the assumptions of Lipschitz continuity and local convexity of $f(\lambda_t)$, and assuming bounded observation noise, the GP-based Bayesian optimization approach for checkpoint merging converges almost surely to a merging weight $\lambda_t^*$ that maximizes $f(\widetilde{\Theta}_t)$, such tha

Figures (7)

  • Figure 1: Overview of the Bayesian optimization framework for checkpoint merging in LLM pretraining. The framework operates by linearly combining intermediate checkpoints $\Theta_{t-1}$ and $\Theta_t$ with optimized merging weights $\lambda_t$. Through iterative Bayesian optimization, the method identifies performance "sweet spots" in the loss landscape that enhance model efficacy without much additional computational resources, effectively transforming intermediate checkpoints into improved models.
  • Figure 2: Performance landscape of pairwise checkpoint merging using the Greedy Soup method on the C-Eval benchmark across 11 Baichuan2 checkpoints spanning 200B to 2640B tokens. The heatmap reveals that merging adjacent checkpoints (near the diagonal) generally yields superior performance, while merging distant checkpoints results in significant performance degradation.
  • Figure 3: Impact of varying merging weights on model performance when combining two representative checkpoint pairs:Baichuan2-1540B with Baichuan2-1760B and Baichuan2-2200B with Baichuan2-2420B. The graph illustrates accuracy on the C-Eval dataset as a function of uniformly sampled merging weights ranging from 0 to 1. The results demonstrate distinct patterns: for checkpoints with performance gaps, optimal weights favor the stronger model, while for similarly performing checkpoints, a broad range of weights (76%) can yield improvements, highlighting the complexity of the optimization landscape.
  • Figure 4: Effect of merging weight search space boundaries on the performance of merged Baichuan2-1760B and Baichuan2-1980B models evaluated on the C-Eval dataset. The figure illustrates how varying the lower bound parameter $\alpha$ (set to 0.5, 0.7, and 0.9) influences accuracy outcomes in our Bayesian optimization framework. Results show that moderate search space constraints ($\alpha = 0.5$ or $0.7$) yield optimal performance, while overly restrictive bounds ($\alpha = 0.9$) lead to performance degradation.
  • Figure 5: Empirical comparison of merging strategies across different numbers of adjacent checkpoints using the Greedy Soup method on the C-Eval dataset. The analysis compares performance when merging two, three, and four consecutive Baichuan2 checkpoints across various training stages (200B to 2640B tokens). Results demonstrate that pairwise merging consistently outperforms multi-checkpoint combinations, with diminishing returns as more checkpoints are included.
  • ...and 2 more figures

Theorems & Definitions (6)

  • proof
  • proof
  • Theorem 1: Convergence of GP-based Checkpoint Merging
  • proof
  • Theorem 2: PAC-Bayesian Generalization Bound
  • proof