Table of Contents
Fetching ...

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang

TL;DR

NoWag introduces a unified, weight- and activation-guided framework for shape-preserving compression of large language models. By normalizing weight matrices and optimizing a weighted, activation-aware loss, it unifies vector quantization (NoWag-VQ) and pruning (NoWag-P) under a single objective, achieving strong one-shot performance. Empirical results show NoWag-VQ surpasses state-of-the-art one-shot VQ methods with far less calibration data, while NoWag-P outperforms leading pruning baselines across multiple model sizes, pattern types, and datasets. This normalization-centric approach reduces sensitivity to outlier weights, accelerates compression, and enables efficient deployment of large models with substantial memory and speedups.

Abstract

Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

TL;DR

NoWag introduces a unified, weight- and activation-guided framework for shape-preserving compression of large language models. By normalizing weight matrices and optimizing a weighted, activation-aware loss, it unifies vector quantization (NoWag-VQ) and pruning (NoWag-P) under a single objective, achieving strong one-shot performance. Empirical results show NoWag-VQ surpasses state-of-the-art one-shot VQ methods with far less calibration data, while NoWag-P outperforms leading pruning baselines across multiple model sizes, pattern types, and datasets. This normalization-centric approach reduces sensitivity to outlier weights, accelerates compression, and enables efficient deployment of large models with substantial memory and speedups.

Abstract

Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag

Paper Structure

This paper contains 27 sections, 6 equations, 5 figures, 17 tables, 6 algorithms.

Figures (5)

  • Figure 1: Illustration of proposed NoWag (Normalized Weight and Activation Guided Compression). Given a LLM, compression of each weight matrix $\mathop{\mathrm{W}}\nolimits$ is performed independently. Vectors $r^{\left(1\right)}$ and $r^{\left(2\right)}$ are used to normalize $\mathop{\mathrm{W}}\nolimits$. With the second moment of the activations $\text{diag}\left(XX^T\right)$ guiding the importance of each weight for the compression algorithm, such as K-means VQ (NoWag-VQ) and Pruning (NoWag-P)
  • Figure 2: A sample weight from the first attention layer of Llama-2-7B. From left to right: visualization of the absolute values of the weights, normalized weights, importance scores, and normalized importance scores all down-sampled to 1:4 scale by max pooling. Individual elements are visualized in log scale, with blue implying larger value.
  • Figure 3: 2d PCA visualization of the distribution of $d=6$ grouped entries from $\mathop{\mathrm{W}}\nolimits$ and $\mathop{\mathrm{\bar{W}}}\nolimits$. Densities are plotted at log scale. Normalization reshapes the distribution into a more "ball-shaped distribution.
  • Figure 4: Relative difference in C4 perplexity NoWag-P between Wanda: $(\text{NoWag\xspace Perplexity})/(\text{Wanda Perplexity})-1$. Calculated for a range of semi structured patterns for Llama-2-13B and Llama-3-8B
  • Figure : NoWag Normalization