STEP: Staged Parameter-Efficient Pre-training for Large Language Models
Kazuki Yano, Takumi Ito, Jun Suzuki
TL;DR
STEP tackles the memory bottleneck in pre-training large language models by staging model growth and parameter-efficient tuning. It interleaves vanilla pre-training on a small model, growth via the Growth Operator, PET adapters, and selective retraining, with an ILP-based memory optimization to bound peak memory. Empirically, STEP achieves up to $53.9\%$ memory reduction while preserving perplexity and downstream performance; instruction-tuned STEPed models perform on par with vanilla models. This approach enables memory-constrained researchers to undertake LLM pre-training with competitive results and scalable memory savings.
Abstract
Pre-training large language models (LLMs) faces significant memory challenges due to the large size of model parameters. We introduce STaged parameter-Efficient Pre-training (STEP), which integrates parameter-efficient tuning techniques with model growth. We conduct experiments on pre-training LLMs of various sizes and demonstrate that STEP achieves up to a 53.9% reduction in maximum memory requirements compared to vanilla pre-training while maintaining equivalent performance. Furthermore, we show that the model by STEP performs comparably to vanilla pre-trained models on downstream tasks after instruction tuning.
