Table of Contents
Fetching ...

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu

TL;DR

This work summarizes existing approaches into four atomic growth operators and systematically evaluates them in a standardized LLM pre-training setting, revealing that a depthwise stacking operator, called G_{\text{stack}, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines.

Abstract

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://llm-stacking.github.io.

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

TL;DR

This work summarizes existing approaches into four atomic growth operators and systematically evaluates them in a standardized LLM pre-training setting, revealing that a depthwise stacking operator, called G_{\text{stack}, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines.

Abstract

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical bstacles: (1) lack of comprehensive evaluation, (2) untested viability for scaling, and (3) lack of empirical guidelines. To tackle 1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called , exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into to address 2 and 3. For 2 (untested scalability), our study shows that is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address 3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for , making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of . Our code and pre-trained model are available at https://llm-stacking.github.io.
Paper Structure (77 sections, 41 equations, 44 figures, 7 tables, 2 algorithms)

This paper contains 77 sections, 41 equations, 44 figures, 7 tables, 2 algorithms.

Figures (44)

  • Figure 1: The training loss for two 7B LLMs, trained from scratch and with $G_{\text{direct}}^{\uparrow}$ ($G_{\text{stack}}$). At 300B tokens, $G_{\text{stack}}$ accelerates by 54.6% compared to scratch.
  • Figure 2: The simplified illustration of four growth operators $G_{\text{direct}}$, $G_{\text{learn}}$, $G_{\text{zero}}$ and $G_{\text{random}}$, each of which can grow along widthwise (intra-layer) $G^{\to}$ or depthwise (layer-wise) $G^{\uparrow}$. $\mathbf{W_n}$ is the parameters before growth, while $\mathbf{D_n}$ , $\mathbf{R_n}$ and $\mathbf{O}$ are the growth parameters derived from the old, randomly initialized, and zero-initialized respectively. Except $G_{\text{direct}}$, other three operators only illustrates the widthwise growth.
  • Figure 3: We evaluate operators using training loss and Lambada paperno2016lambada, ARC-c clark2018think, ARC-e clark2018think, Logiqa liu2020logiqa, PIQA bisk2019piqa, Sciq welbl2017crowdsourcing, Winogrande sakaguchi2019winogrande and Wikitext PPL merity2016pointer totaling eight standard NLP benchmarks. After $8 \times 10^{20}$ FLOPs of training, $G_{\text{direct}}^\uparrow$ demonstrates a significant speedup.
  • Figure 4: Training 3B LLMs with 300B tokens. $G_{\text{stack}}$ significantly outperforms scratch in (a) loss and (b) average accuracy across NLP benchmarks. At 180B and 240B tokens, $G_{\text{stack}}$ accelerates by 48.6% and 54.5% compared to scratch.
  • Figure 5: Training 7B LLMs with 300B tokens. $G_{\text{stack}}$ significantly outperforms scratch in (a) loss and (b) average accuracy across NLP benchmarks. At 160B, 220B and 280B tokens, $G_{\text{stack}}$ accelerates by 40.8%, 55.3% and 53.8% compared to scratch.
  • ...and 39 more figures