Table of Contents
Fetching ...

On the Effectiveness of Incremental Training of Large Language Models

Miles Q. Li, Benjamin C. M. Fung, Shih-Chia Huang

TL;DR

Problem: efficiently training transformer-based large language models (LLMs) at scale. The paper proposes incremental layer-wise training with stage-wise addition, two training phases per stage, and an optional continual training phase, accompanied by a formal cost framework that includes expressions such as $C_{baseline} = 2TLc$, $C_{incremental} = \frac{T_{inc} c L (3S + 5)}{4S}$, and $T_{cont} = \frac{5}{8}\left(1-\frac{1}{S}\right)T$. Method: evaluate the approach using a GPT-2–style model (12 layers, 124.4M params) trained on 10B tokens, across staged configurations (4, 8, 12), and compare against end-to-end training under equal compute budgets. Key findings: incremental training yields initial reductions in per-step cost but requires substantial continual training to reach baseline performance, resulting in higher total compute to achieve similar results; thus it does not provide a practical efficiency advantage. Significance: the results challenge incremental layer-wise training as a viable strategy for LLM training and highlight the need for alternative, more effective efficiency methods in large-scale language modeling.

Abstract

Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the training process by progressively introducing layers, with the expectation that this approach would lead to faster convergence and more efficient use of computational resources. In this paper, we investigate the effectiveness of incremental training for LLMs, dividing the training process into multiple stages where layers are added progressively. Our experimental results indicate that while the incremental approach initially demonstrates some computational efficiency, it ultimately requires greater overall computational costs to reach comparable performance to traditional full-scale training. Although the incremental training process can eventually close the performance gap with the baseline, it does so only after significantly extended continual training. These findings suggest that incremental layer-wise training may not be a viable alternative for training large language models, highlighting its limitations and providing valuable insights into the inefficiencies of this approach.

On the Effectiveness of Incremental Training of Large Language Models

TL;DR

Problem: efficiently training transformer-based large language models (LLMs) at scale. The paper proposes incremental layer-wise training with stage-wise addition, two training phases per stage, and an optional continual training phase, accompanied by a formal cost framework that includes expressions such as , , and . Method: evaluate the approach using a GPT-2–style model (12 layers, 124.4M params) trained on 10B tokens, across staged configurations (4, 8, 12), and compare against end-to-end training under equal compute budgets. Key findings: incremental training yields initial reductions in per-step cost but requires substantial continual training to reach baseline performance, resulting in higher total compute to achieve similar results; thus it does not provide a practical efficiency advantage. Significance: the results challenge incremental layer-wise training as a viable strategy for LLM training and highlight the need for alternative, more effective efficiency methods in large-scale language modeling.

Abstract

Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the training process by progressively introducing layers, with the expectation that this approach would lead to faster convergence and more efficient use of computational resources. In this paper, we investigate the effectiveness of incremental training for LLMs, dividing the training process into multiple stages where layers are added progressively. Our experimental results indicate that while the incremental approach initially demonstrates some computational efficiency, it ultimately requires greater overall computational costs to reach comparable performance to traditional full-scale training. Although the incremental training process can eventually close the performance gap with the baseline, it does so only after significantly extended continual training. These findings suggest that incremental layer-wise training may not be a viable alternative for training large language models, highlighting its limitations and providing valuable insights into the inefficiencies of this approach.

Paper Structure

This paper contains 32 sections, 17 equations, 2 figures.

Figures (2)

  • Figure 1: Training and validation loss curves comparing incremental layer-wise training (with 4, 8, and 12 stages) and baseline training. The large solid circles mark the points where the incremental training regimes have reached the same cumulative computational cost as the baseline model trained for 10,000 steps.
  • Figure 2: HellaSwag accuracy scores comparing incremental layer-wise training (with 4, 8, and 12 stages) and baseline training. The large solid circles indicate the performance of the incremental models at the steps where their cumulative computational cost equals that of the baseline model trained for 10,000 steps.