Table of Contents
Fetching ...

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong

TL;DR

The paper tackles the high cost of pretraining large language models by treating converged pretrained checkpoints as valuable sunk cost and proposing two orthogonal Mixture-of-Experts growth strategies: interpositional depth growth and noise-injected width growth. It demonstrates that deeper, converged models benefit from an interpositional expansion approach that preserves learned layer structure, while width growth with Gaussian noise promotes expert specialization and stability across pre-norm and post-norm transformers. A comprehensive analysis shows a strong positive correlation between prior investment (sunk FLOPs) and final performance, and the approach is scalable, demonstrated by growing a 17B MoE to 70B with 1T tokens and achieving a 10.66% average accuracy gain over scratch under the same additional compute. Under a fixed total FLOPs budget, growth is often competitive with or superior to scratch training, supporting economically efficient LLM pretraining by reusing existing checkpoints. The work provides a foundation for sustainable scaling of LLMs by reusing prior computation, with practical guidelines on growth timing and methodology.

Abstract

The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.

Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training

TL;DR

The paper tackles the high cost of pretraining large language models by treating converged pretrained checkpoints as valuable sunk cost and proposing two orthogonal Mixture-of-Experts growth strategies: interpositional depth growth and noise-injected width growth. It demonstrates that deeper, converged models benefit from an interpositional expansion approach that preserves learned layer structure, while width growth with Gaussian noise promotes expert specialization and stability across pre-norm and post-norm transformers. A comprehensive analysis shows a strong positive correlation between prior investment (sunk FLOPs) and final performance, and the approach is scalable, demonstrated by growing a 17B MoE to 70B with 1T tokens and achieving a 10.66% average accuracy gain over scratch under the same additional compute. Under a fixed total FLOPs budget, growth is often competitive with or superior to scratch training, supporting economically efficient LLM pretraining by reusing existing checkpoints. The work provides a foundation for sustainable scaling of LLMs by reusing prior computation, with practical guidelines on growth timing and methodology.

Abstract

The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.

Paper Structure

This paper contains 22 sections, 6 equations, 12 figures, 26 tables.

Figures (12)

  • Figure 1: Main effect and method of our model growth framework
  • Figure 2: Characteristic layer-wise weight norm distribution in pre-trained LLMs, including pre-trained models in this work and from open-source community.
  • Figure 3: Performance comparison of interposition and stack depth growth strategies. Left: training loss; Right: average downstream task accuracy.
  • Figure 4: The impact of noise injection scale on width growth performance. Left: training loss; Right: average downstream task accuracy.
  • Figure 5: Comparative analysis of performance and stability between depth and width growth.
  • ...and 7 more figures