Preparing Lessons for Progressive Training on Language Models

Yu Pan; Ye Yuan; Yichun Yin; Jiaxin Shi; Zenglin Xu; Ming Zhang; Lifeng Shang; Xin Jiang; Qun Liu

Preparing Lessons for Progressive Training on Language Models

Yu Pan, Ye Yuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR

Transformer training incurs high computational and environmental costs, motivating a universal scratch-based acceleration approach. Apollo introduces Low-Value-Prioritized Sampling to guide shallow layers in learning high-layer functionality, employs weight sharing to expand depth efficiently, and uses layer interpolation to stabilize training during expansion. Across BERT and GPT, Apollo achieves substantial FLOPs savings and competitive downstream performance, often outperforming pretrained-model baselines. The method promises greener, scalable AI by reducing training time and resource use while remaining broadly applicable to novel model designs.

Abstract

The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.

Preparing Lessons for Progressive Training on Language Models

TL;DR

Abstract

Paper Structure (19 sections, 11 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 11 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Efficient training from scratches.
Efficient training by reusing pretrained models.
Method
Notations
Efficient Training by Apollo
Stack V.S. Interpolation
Experiment
Common setting:
Experiment on Expanding Method
Experiment on Sampling Method
Experiment on BERT
Experiment on GPT
Conclusion
...and 4 more sections

Figures (9)

Figure 1: An illustration of the Apollo for training an $L$-layered model within $T$ steps. We divide this training process into $S$ stages. In the $t$-th step at the $s$-th stage, the model weights are $N^{(s)}$ layers (the left layers in each stage in the figure). To let the $N^{(s)}$ layers learn functionality in high layers in advance, we construct $L^{(t)}$ layers (the right layers in each stage in the figure) by sharing the $N^{(s)}$ weights through an interpolation method, where $N^{(s)} \leq L^{(t)}$. As shown in the figure, the same color denotes the same weight. We randomly choose $L^{(t)}$ at $t$-th step through a probability function Low-Value-Prioritized Sampling (LVPS). Since LVPS tends to select shallower layers, it can greatly save computation costs. Furthermore, we progressively increase the $N^{(s)}$ weights when stepping into the next stage. Since weights in the early stage can learn the properties of higher layers, Apollo can significantly contribute to the training efficiency.
Figure 2: A case of choosing hyper-parameters $k$ of LVPS to sample 1-6 layer number.
Figure 3: Comparison among US, FS, ES, and LVPS to sample 1-6 layer number.
Figure 4: A case of expanding 3 layers to 6 layers. The same color denotes the same weight. The stacking method recurrently arranges the layers, e.g., the 1-st layer $\rightarrow$ the 4-th layer. By contrast, the interpolation method arranges the layers in a neighbor, e.g., the 1-st layer $\rightarrow$ the 2-nd layer.
Figure 5: Distribution of output activations. BERT-Base/2 is half of BERT-Base. BERT-Base/2-S and BERT-Base/2-I denote to stack and interpolate BERT-Base/2 to BERT-Base, respectively. After stacking BERT-Base/2, the distribution of output activations changes a lot, while the interpolation method keeps the distribution well.
...and 4 more figures

Preparing Lessons for Progressive Training on Language Models

TL;DR

Abstract

Preparing Lessons for Progressive Training on Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)