GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

Hongye Jin; Xiaotian Han; Jingfeng Yang; Zhimeng Jiang; Chia-Yuan Chang; Xia Hu

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Chia-Yuan Chang, Xia Hu

TL;DR

Training large language models is computationally expensive; GrowLength proposes progressively growing the training sequence length during pretraining to improve efficiency. It leverages RoPE-based positional embeddings and insights from content window extension to enable short-sequence pretraining that scales to longer contexts, using direct position extrapolation. Empirical results across model sizes demonstrate faster convergence and lower loss under the same training time, along with improved long-context capabilities. The method is orthogonal to existing acceleration techniques and can reduce training costs while expanding effective context during pretraining.

Abstract

The evolving sophistication and intricacies of Large Language Models (LLMs) yield unprecedented advancements, yet they simultaneously demand considerable computational resources and incur significant costs. To alleviate these challenges, this paper introduces a novel, simple, and effective method named ``\growlength'' to accelerate the pretraining process of LLMs. Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency. For instance, it begins with a sequence length of 128 and progressively extends to 4096. This approach enables models to process a larger number of tokens within limited time frames, potentially boosting their performance. In other words, the efficiency gain is derived from training with shorter sequences optimizing the utilization of resources. Our extensive experiments with various state-of-the-art LLMs have revealed that models trained using our method not only converge more swiftly but also exhibit superior performance metrics compared to those trained with existing methods. Furthermore, our method for LLMs pretraining acceleration does not require any additional engineering efforts, making it a practical solution in the realm of LLMs.

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

TL;DR

Abstract

Paper Structure (19 sections, 3 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 3 equations, 6 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries and Motivation
Positional Embedding
Content Windows Extension in Fine-tuning Phase
Motivation
Method
Implementation
What advantages can be gained by training LLMs with shorter sequences?
Discussion
Experiments
How fast can the proposed method accelerate the LLMs pretraining?
Will the proposed method result in the same or lower loss ?
How does our proposed method perform on different sizes of the LLMs?
Will our methods show better context windows extension abilities?
The influence from ratios of different window size during training
...and 4 more sections

Figures (6)

Figure 1: Training curves comparison of our proposed method and the baselines are given the same training time. It shows the training loss curves for Large Language Models (LLMs) trained with fixed sequence lengths of 128 (LLM128), 1024 (LLM1024), and our method. Compared with LLM1024, GrowLength attains a lower loss. This can be attributed to that our method processes more tokens within the same training time, allowing the model to have a broader context. Similarly, the comparison between LLM128 and GrowLength reveals that our method also secures a lower loss in this scenario. This is because, the model trained by our method has experienced longer sequences, enabling better learning ability. Compared with both short or long sequence length instances, our proposed method demonstrates enhanced performance within the same pertaining time, establishing its efficacy over the baseline models.
Figure 2: Comparison of the LLMs trained with the same total of tokens.
Figure 3: Comparison of the different sizes of models w/ and w/o GrowLength. Three model pairs (70M, 160M, 410M) are trained at the same time.
Figure 4: Comparison of the context window extension abilities
Figure 5: Comparison of the different ratios of the training length. There are three different ratios: 128, 256, 512, 1024; 128, 512, 1024; 128, 102.
...and 1 more figures

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

TL;DR

Abstract

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

Authors

TL;DR

Abstract

Table of Contents

Figures (6)