Table of Contents
Fetching ...

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

TL;DR

SliceGPT introduces a post-training, structured sparsification method that replaces large weight matrices with smaller dense ones by applying per-block orthogonal transformations and PCA-based slicing, thereby reducing embedding dimensions. Grounded in a computational invariance property of RMSNorm-connected transformers, it enables cutting rows/columns without altering outputs and achieves 25–30% parameter reduction across several large LLMs with limited degradation in generation and zero-shot tasks. It demonstrates meaningful inference-throughput and GPU-count savings on consumer to datacenter GPUs without additional code optimization, and compares favorably to SparseGPT while offering practical deployment benefits. The work also discusses calibration-dependent variability, numeric stability considerations, and avenues for future improvements, including combining with quantization and alternative per-block transformations.

Abstract

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

TL;DR

SliceGPT introduces a post-training, structured sparsification method that replaces large weight matrices with smaller dense ones by applying per-block orthogonal transformations and PCA-based slicing, thereby reducing embedding dimensions. Grounded in a computational invariance property of RMSNorm-connected transformers, it enables cutting rows/columns without altering outputs and achieves 25–30% parameter reduction across several large LLMs with limited degradation in generation and zero-shot tasks. It demonstrates meaningful inference-throughput and GPU-count savings on consumer to datacenter GPUs without additional code optimization, and compares favorably to SparseGPT while offering practical deployment benefits. The work also discusses calibration-dependent variability, numeric stability considerations, and avenues for future improvements, including combining with quantization and alternative per-block transformations.

Abstract

Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression
Paper Structure (35 sections, 1 theorem, 7 equations, 8 figures, 13 tables, 1 algorithm)

This paper contains 35 sections, 1 theorem, 7 equations, 8 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

Let ${\mathbf{W}}_{\textrm{in}}^\ell$ and ${\mathbf{W}}_{\textrm{out}}^\ell$ be the weight matrices of the linear layers of the $\ell$-th block of an RMSNorm-connected transformer network, and ${\bm{b}}_{\textrm{in}}^\ell, {\bm{b}}_\textrm{out}^\ell$ be the corresponding biases, if any, and let ${\m The input and head biases are copied: $\tilde{{\bm{b}}}_\textrm{in}^\ell = {\bm{b}}_\textrm{in}^\el

Figures (8)

  • Figure 1: Matrix multiplication of the signal ${\mathbf{X}}$ and a weight matrix ${\mathbf{W}}$ under different types of sparsity. Left: unstructured sparsity, where some elements of ${\mathbf{W}}$ are zero, and ${\mathbf{X}}$ is dense. Middle: 2:4 structured sparsity, where each block of four weight matrix entries contains two zeros, and ${\mathbf{X}}$ is dense. Right: SliceGPT, where after introducing transformation ${\mathbf{Q}}$, all the sparsity is arranged to the bottom rows of ${\mathbf{W}}$ and the corresponding columns of ${\mathbf{X}}$ are removed.
  • Figure 2: A single layer in a transformer network. The signals (inputs) arising from the previous blocks of the networks arrive at the bottom of the figure, before being passed through attention, LayerNorm, and FFN. The attention and FFN blocks both have input and output linear operations (blue) which we denote in the text as ${\mathbf{W}}_\textrm{in}, {\mathbf{W}}_\textrm{out}$. The linear operations of LayerNorm ${\mathbf{M}}$ and $\textrm{diag}(\boldsymbol{\alpha})$ are highlighted. This and subsequent figures do not show biases.
  • Figure 3: Converting a transformer network from LayerNorm to RMSNorm: the scale matrix $\textrm{diag}(\boldsymbol{\alpha})$ is absorbed into the subsequent matrix ${\mathbf{W}}_\textrm{in}$. Figure shows the block in combined colors. We use $(\boldsymbol{\alpha})$ for brevity. The mean-subtraction matrix ${\mathbf{M}}$ is applied to each matrix ${\mathbf{W}}_\textrm{out}$. Layernorm becomes RMSNorm, up to a constant $\sqrt{D}$ (not shown). Here, the scaling $(\boldsymbol{\alpha'}$) comes from the previous block.
  • Figure 4: With the network converted to RMSNorm (see Figure \ref{['fig:absorb-layernorm']}), we apply the computational-invariance idea. The input weight matrices $\textrm{diag}(\boldsymbol\alpha)$${\mathbf{W}}_\textrm{in}$ are pre-multiplied by ${\mathbf{Q}}^\top$. The output matrices ${\mathbf{W}}_\textrm{out}$${\mathbf{M}}$ are post-multiplied by ${\mathbf{Q}}$. In the skip-connection, a new linear layer is added ${\mathbf{Q}}_{\ell}^\top{\mathbf{Q}}_{\ell+1}$. After these modifications, the matrices can be sliced (hatched areas).
  • Figure 5: Mean zero-shot accuracy on OPT, Llama-2 and Phi-2 across multiple tasks after slicing with the WikiText-2 (top) and Alpaca (bottom) datasets for calibration.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof