Table of Contents
Fetching ...

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Chia-Yuan Chang, Xia Hu

TL;DR

Training large language models is computationally expensive; GrowLength proposes progressively growing the training sequence length during pretraining to improve efficiency. It leverages RoPE-based positional embeddings and insights from content window extension to enable short-sequence pretraining that scales to longer contexts, using direct position extrapolation. Empirical results across model sizes demonstrate faster convergence and lower loss under the same training time, along with improved long-context capabilities. The method is orthogonal to existing acceleration techniques and can reduce training costs while expanding effective context during pretraining.

Abstract

The evolving sophistication and intricacies of Large Language Models (LLMs) yield unprecedented advancements, yet they simultaneously demand considerable computational resources and incur significant costs. To alleviate these challenges, this paper introduces a novel, simple, and effective method named ``\growlength'' to accelerate the pretraining process of LLMs. Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency. For instance, it begins with a sequence length of 128 and progressively extends to 4096. This approach enables models to process a larger number of tokens within limited time frames, potentially boosting their performance. In other words, the efficiency gain is derived from training with shorter sequences optimizing the utilization of resources. Our extensive experiments with various state-of-the-art LLMs have revealed that models trained using our method not only converge more swiftly but also exhibit superior performance metrics compared to those trained with existing methods. Furthermore, our method for LLMs pretraining acceleration does not require any additional engineering efforts, making it a practical solution in the realm of LLMs.

GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length

TL;DR

Training large language models is computationally expensive; GrowLength proposes progressively growing the training sequence length during pretraining to improve efficiency. It leverages RoPE-based positional embeddings and insights from content window extension to enable short-sequence pretraining that scales to longer contexts, using direct position extrapolation. Empirical results across model sizes demonstrate faster convergence and lower loss under the same training time, along with improved long-context capabilities. The method is orthogonal to existing acceleration techniques and can reduce training costs while expanding effective context during pretraining.

Abstract

The evolving sophistication and intricacies of Large Language Models (LLMs) yield unprecedented advancements, yet they simultaneously demand considerable computational resources and incur significant costs. To alleviate these challenges, this paper introduces a novel, simple, and effective method named ``\growlength'' to accelerate the pretraining process of LLMs. Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency. For instance, it begins with a sequence length of 128 and progressively extends to 4096. This approach enables models to process a larger number of tokens within limited time frames, potentially boosting their performance. In other words, the efficiency gain is derived from training with shorter sequences optimizing the utilization of resources. Our extensive experiments with various state-of-the-art LLMs have revealed that models trained using our method not only converge more swiftly but also exhibit superior performance metrics compared to those trained with existing methods. Furthermore, our method for LLMs pretraining acceleration does not require any additional engineering efforts, making it a practical solution in the realm of LLMs.
Paper Structure (19 sections, 3 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 3 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Training curves comparison of our proposed method and the baselines are given the same training time. It shows the training loss curves for Large Language Models (LLMs) trained with fixed sequence lengths of 128 (LLM128), 1024 (LLM1024), and our method. Compared with LLM1024, GrowLength attains a lower loss. This can be attributed to that our method processes more tokens within the same training time, allowing the model to have a broader context. Similarly, the comparison between LLM128 and GrowLength reveals that our method also secures a lower loss in this scenario. This is because, the model trained by our method has experienced longer sequences, enabling better learning ability. Compared with both short or long sequence length instances, our proposed method demonstrates enhanced performance within the same pertaining time, establishing its efficacy over the baseline models.
  • Figure 2: Comparison of the LLMs trained with the same total of tokens.
  • Figure 3: Comparison of the different sizes of models w/ and w/o GrowLength. Three model pairs (70M, 160M, 410M) are trained at the same time.
  • Figure 4: Comparison of the context window extension abilities
  • Figure 5: Comparison of the different ratios of the training length. There are three different ratios: 128, 256, 512, 1024; 128, 512, 1024; 128, 102.
  • ...and 1 more figures