NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models
Lawrence Liu, Inesh Chakrabarti, Yixiao Li, Mengdi Wang, Tuo Zhao, Lin F. Yang
TL;DR
NoWag introduces a unified, weight- and activation-guided framework for shape-preserving compression of large language models. By normalizing weight matrices and optimizing a weighted, activation-aware loss, it unifies vector quantization (NoWag-VQ) and pruning (NoWag-P) under a single objective, achieving strong one-shot performance. Empirical results show NoWag-VQ surpasses state-of-the-art one-shot VQ methods with far less calibration data, while NoWag-P outperforms leading pruning baselines across multiple model sizes, pattern types, and datasets. This normalization-centric approach reduces sensitivity to outlier weights, accelerates compression, and enables efficient deployment of large models with substantial memory and speedups.
Abstract
Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag
