Table of Contents
Fetching ...

STEP: Staged Parameter-Efficient Pre-training for Large Language Models

Kazuki Yano, Takumi Ito, Jun Suzuki

TL;DR

STEP tackles the memory bottleneck in pre-training large language models by staging model growth and parameter-efficient tuning. It interleaves vanilla pre-training on a small model, growth via the Growth Operator, PET adapters, and selective retraining, with an ILP-based memory optimization to bound peak memory. Empirically, STEP achieves up to $53.9\%$ memory reduction while preserving perplexity and downstream performance; instruction-tuned STEPed models perform on par with vanilla models. This approach enables memory-constrained researchers to undertake LLM pre-training with competitive results and scalable memory savings.

Abstract

Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.

STEP: Staged Parameter-Efficient Pre-training for Large Language Models

TL;DR

STEP tackles the memory bottleneck in pre-training large language models by staging model growth and parameter-efficient tuning. It interleaves vanilla pre-training on a small model, growth via the Growth Operator, PET adapters, and selective retraining, with an ILP-based memory optimization to bound peak memory. Empirically, STEP achieves up to memory reduction while preserving perplexity and downstream performance; instruction-tuned STEPed models perform on par with vanilla models. This approach enables memory-constrained researchers to undertake LLM pre-training with competitive results and scalable memory savings.

Abstract

Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.

Paper Structure

This paper contains 39 sections, 5 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Overview of STEP (STaged parameter Efficient Pre-training). First, vanilla pre-training is performed on a small-scale model (Procedure 1). Subsequently, new layers are added to grow the pre-trained model (Procedure 2). The parameters of the pre-trained layers are then frozen, and Parameter-Efficient Training (PET) is applied for alternative training (Procedure 3), followed by retraining of the expanded model (Procedure 4). In Procedure 4, only the parameters added through layer expansion and the small-scale parameters introduced by PET are subject to training.
  • Figure 2: Memory consumption of pre-training 1.2B in Table \ref{['tab:model_config']}. STEP allows for increasing the model size while keeping memory usage consistent at every stage.
  • Figure 3: Illustration of different strategies for adding new layers in STEP. 'Upper' adds layers at the top, 'Intermediate' inserts layers in the middle, and 'Lower' adds layers at the bottom.