Table of Contents
Fetching ...

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi

TL;DR

SLoPe introduces a double-pruned sparse plus lazy low-rank adapter pretraining scheme for LLMs, combining a static $N:M$ sparsity pattern with a backward pass that is pruned twice to accelerate both forward and backward computations while preserving accuracy. The method adds low-rank adapters only in the final 1% of pretraining, decomposed as $W_{dense} = W_{sparse} + LR$, to boost capacity with minimal overhead. Key contributions include convergence-guaranteed double-pruned backward pass, lazy low-rank adapters, and optimized CUDA kernels that enable end-to-end speedups of up to $1.25\times$ for training and $1.54\times$ for inference, with memory reductions of up to $0.63\times$ and $0.61\times$ respectively. Empirical results on GPT2 and BERT-like setups show improved pretraining perplexities and downstream task performance compared to prior sparse pretraining methods, validating the practicality of sparse+low-rank pretraining for very large models.

Abstract

We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.25\times$ and $1.54\times$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.63\times$ and $0.61\times$ for training and inference respectively.

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

TL;DR

SLoPe introduces a double-pruned sparse plus lazy low-rank adapter pretraining scheme for LLMs, combining a static sparsity pattern with a backward pass that is pruned twice to accelerate both forward and backward computations while preserving accuracy. The method adds low-rank adapters only in the final 1% of pretraining, decomposed as , to boost capacity with minimal overhead. Key contributions include convergence-guaranteed double-pruned backward pass, lazy low-rank adapters, and optimized CUDA kernels that enable end-to-end speedups of up to for training and for inference, with memory reductions of up to and respectively. Empirical results on GPT2 and BERT-like setups show improved pretraining perplexities and downstream task performance compared to prior sparse pretraining methods, validating the practicality of sparse+low-rank pretraining for very large models.

Abstract

We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to and respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to and for training and inference respectively.
Paper Structure (33 sections, 2 theorems, 20 equations, 10 figures, 17 tables, 1 algorithm)

This paper contains 33 sections, 2 theorems, 20 equations, 10 figures, 17 tables, 1 algorithm.

Key Result

Lemma 2.1

Consider a randomly initialized matrix $A$. Following our notations, we denote the row-wise pruned version of $A$ by $A^R$ and the joint column- and row-wise pruned version of $A$ by $A^{R, C}$. We use $D(.)$ to present the density ratio of a matrix. eq:sparsity_ratio shows the additional zero eleme

Figures (10)

  • Figure 1: The sparse training pipeline in SLoPe. Here, $\mathcal{X}$, $\mathcal{Y}$, and $\mathcal{W}$ denote the input, output, and the weight tensors for a specific layer, respectively. $\nabla_{\cdot}\mathcal{L}$ represents the gradient of the loss function. $\mathcal{L}$ and $\mathcal{R}$ are the low-rank terms that are introduced only in the final 1% iterations. Superscript $R$ shows row-wise pruning using $N$:$M$ scheme and $R,C$ shows both column and row-wise $N$:$M$ sparsification, leading to extra imposed zeros. Blue elements represent non-zero values, while white elements represent pruned values, and red elements indicate additional zeros introduced during the backward pass.
  • Figure 2: Validation perplexity of GPT2-Small and GPT2-Large on OpenWebText. $\gamma_w$ shows the value of the decay factor parameter in Extended SR-STE (FST).
  • Figure 3: (a) The speedup achieved using cuSPARSELt backend in PyTorch for Attention ($d_{out} = d_{in}$), Upsample ($d_{out} = 4d_{in}$) and Downsample ($d_{out} = \frac{d_{in}}{4}$) matrices with a batch size of 2048. (b) The cosine similarity of the low-rank adapters and the converged adapters for different layers in the model. The cosine similarities are averaged among the 24 layers of BERT-Large-Uncased.
  • Figure 4: Average mask difference between each iteration and the converged sparsity pattern in GPT2-Small pretraining using SR-STE. The highlighted area shows the ratio of the resources used for updating weights that are pruned and not used in the inference of the model.
  • Figure 5: The setup and multiplication time for square matrices using the cuSPARSELt SpMM backend.
  • ...and 5 more figures

Theorems & Definitions (2)

  • Lemma 2.1
  • Theorem 2.2