Table of Contents
Fetching ...

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Athanasios Glentis, Jiaxiang Li, Qiulin Shang, Andi Han, Ioannis Tsaknakis, Quan Wei, Mingyi Hong

TL;DR

This work analyzes scalability challenges in pretraining large language models and evaluates parameter- and memory-efficient approaches. Through a comprehensive survey and a cross-size benchmark, it shows that full-rank pretraining with a proper optimizer still yields the best perplexity, while low-rank updates can perform competitively at smaller scales when augmented by high-rank updates. The authors introduce two practical techniques—weight refactorization and momentum reset—that significantly boost the performance of low-rank methods and reduce memory usage by about 25% on a 1B model. Their findings indicate that, with careful optimization and these innovations, parameter- and memory-efficient pretraining can approach, and in some settings rival, full-rank training while alleviating resource demands. The work provides actionable guidance for practitioners and lays groundwork for extended evaluations across more models and datasets.

Abstract

Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by substantial computational challenges, particularly regarding the memory and compute resources required for training and fine-tuning. Numerous approaches have been explored to address these issues, such as LoRA. While these methods are effective for fine-tuning, their application to pre-training is significantly more challenging due to the need to learn vast datasets. Motivated by this issue, we aim to address the following questions: Can parameter- or memory-efficient methods enhance pre-training efficiency while achieving performance comparable to full-model training? How can the performance gap be narrowed? To this end, the contributions of this work are the following. (1) We begin by conducting a comprehensive survey that summarizes state-of-the-art methods for efficient pre-training. (2) We perform a benchmark evaluation of several representative memory efficient pre-training approaches to comprehensively evaluate their performance across model sizes. We observe that with a proper choice of optimizer and hyperparameters, full-rank training delivers the best performance, as expected. We also notice that incorporating high-rank updates in low-rank approaches is the key to improving their performance. (3) Finally, we propose two practical techniques, namely weight refactorization and momentum reset, to enhance the performance of efficient pre-training methods. We observe that applying these techniques to the low-rank method (on a 1B model) can achieve a lower perplexity than popular memory efficient algorithms such as GaLore and Fira, while simultaneously using about 25% less memory.

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

TL;DR

This work analyzes scalability challenges in pretraining large language models and evaluates parameter- and memory-efficient approaches. Through a comprehensive survey and a cross-size benchmark, it shows that full-rank pretraining with a proper optimizer still yields the best perplexity, while low-rank updates can perform competitively at smaller scales when augmented by high-rank updates. The authors introduce two practical techniques—weight refactorization and momentum reset—that significantly boost the performance of low-rank methods and reduce memory usage by about 25% on a 1B model. Their findings indicate that, with careful optimization and these innovations, parameter- and memory-efficient pretraining can approach, and in some settings rival, full-rank training while alleviating resource demands. The work provides actionable guidance for practitioners and lays groundwork for extended evaluations across more models and datasets.

Abstract

Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by substantial computational challenges, particularly regarding the memory and compute resources required for training and fine-tuning. Numerous approaches have been explored to address these issues, such as LoRA. While these methods are effective for fine-tuning, their application to pre-training is significantly more challenging due to the need to learn vast datasets. Motivated by this issue, we aim to address the following questions: Can parameter- or memory-efficient methods enhance pre-training efficiency while achieving performance comparable to full-model training? How can the performance gap be narrowed? To this end, the contributions of this work are the following. (1) We begin by conducting a comprehensive survey that summarizes state-of-the-art methods for efficient pre-training. (2) We perform a benchmark evaluation of several representative memory efficient pre-training approaches to comprehensively evaluate their performance across model sizes. We observe that with a proper choice of optimizer and hyperparameters, full-rank training delivers the best performance, as expected. We also notice that incorporating high-rank updates in low-rank approaches is the key to improving their performance. (3) Finally, we propose two practical techniques, namely weight refactorization and momentum reset, to enhance the performance of efficient pre-training methods. We observe that applying these techniques to the low-rank method (on a 1B model) can achieve a lower perplexity than popular memory efficient algorithms such as GaLore and Fira, while simultaneously using about 25% less memory.

Paper Structure

This paper contains 30 sections, 3 theorems, 43 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Lemma 4.1

Let $B^* A^*$ be a local minimizer of $\ell$, i.e. $\nabla \ell (B^*A^*) = 0$ and $\nabla^2 \ell (B^* A^*) \succ 0$. Let $\underline{\lambda}, \overline{\lambda}$ be the smallest and largest eigenvalue of $\nabla^2 \ell (B^* A^*)$, and thus $\overline{\lambda} \geq \underline{\lambda} > 0$. Then we Let $B^* A^* = U \Sigma V^\top$ be the SVD and suppose $B^* = U \Sigma^{\alpha}$ and $A^* = \Sigma^

Figures (6)

  • Figure 1: Estimated memory consumption of pre-training and inference for a 7B model with a token batch size of 256 on a single device. All methods use BF16 format and AdamW (without optimizer and activation checkpointing). For low-rank, LoRA and GaLore we assume a rank of $512$ (with the original attention head dimension as $2048$), and for SLTrain we use rank $512$ and $\delta=0.1$. The weight factorization methods (low-rank and SLTrain) save more memory compared to memory-efficient optimizers (GaLore and Fira), with a possible compromise on the performance (See Section \ref{['sec:bench_res']} for the results).
  • Figure 2: Loss curve and evaluation perplexity before and after employing the momentum reset and adaptive gradient clipping, as suggested by huang2025stablespam.
  • Figure 3: Performance of 60M, 130M and 350M Llama models after hyperparameter search (Wandb sweep).
  • Figure 4: Perplexity, memory, and parameter size for pretraining LLaMA 1B on the C4 dataset with different methods. The radius and color of each circle scale with parameter size. Overall, the methods which have smaller, lighter circles on the left bottom corner are desirable for pretraining.
  • Figure 5: Scaling performance (in terms of evaluation perplexity) of different model structures with respect to compute (FLOPs). For the x-axis, the FLOPs are multiples of $10^{18}$. Here we include all the training trajectories (with different the hyperparameter settings) to construct such scaling law.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Lemma 4.1
  • Theorem 4.1: Theorem 7.4 in garrigos2023handbook
  • Theorem 4.2