Table of Contents
Fetching ...

Neural Weight Compression for Language Models

Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee

TL;DR

This work tackles the challenge of storing and transmitting the weights of large language models by introducing Neural Weight Compression (NWC), a learned autoencoder-based neural codec trained directly on pretrained weight tensors. NWC handles diverse weight shapes via column-wise chunking, uses an importance-aware loss to allocate precision where it matters, and employs inference-time error compensation to mitigate reconstruction damage on downstream model predictions. The approach yields competitive or state-of-the-art rate–distortion performance, particularly at mid-to-high bitrates around $4$–$6$ bits, and generalizes to vision encoders beyond language models. While hardware support for a neural decoder remains a bottleneck for practical speedups, the framework offers a promising data-driven pathway for scalable, cross-domain weight compression and potential hardware co-design opportunities.

Abstract

The efficient storage and transmission of language model weights is becoming increasingly important, as their scale and adoption continue to grow. However, as our understanding of this new data modality is limited, designing a good compression algorithm for language model weights heavily relies on manual, trial-and-error approaches. In this paper, we propose a learned compression framework that trains neural codecs directly from pretrained language model weights. Unlike conventional data (e.g., images), language model weights pose unique challenges: the sizes and shapes of weight tensors vary significantly, and the reconstruction quality must be judged by downstream model predictions rather than naïve MSE loss. To address this, we introduce Neural Weight Compression (NWC), a novel autoencoder-based neural codec tailored to model weight compression. The proposed method inherits the advantages of autoencoder-based codecs while incorporating three technical components: (1) column-wise tensor chunking and normalization; (2) an importance-aware training loss; (3) an inference-time error compensation mechanism guided by model outputs. Experiments on open-weight language models show that NWC achieves competitive or state-of-the-art accuracy-compression tradeoffs, with particularly strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.

Neural Weight Compression for Language Models

TL;DR

This work tackles the challenge of storing and transmitting the weights of large language models by introducing Neural Weight Compression (NWC), a learned autoencoder-based neural codec trained directly on pretrained weight tensors. NWC handles diverse weight shapes via column-wise chunking, uses an importance-aware loss to allocate precision where it matters, and employs inference-time error compensation to mitigate reconstruction damage on downstream model predictions. The approach yields competitive or state-of-the-art rate–distortion performance, particularly at mid-to-high bitrates around bits, and generalizes to vision encoders beyond language models. While hardware support for a neural decoder remains a bottleneck for practical speedups, the framework offers a promising data-driven pathway for scalable, cross-domain weight compression and potential hardware co-design opportunities.

Abstract

The efficient storage and transmission of language model weights is becoming increasingly important, as their scale and adoption continue to grow. However, as our understanding of this new data modality is limited, designing a good compression algorithm for language model weights heavily relies on manual, trial-and-error approaches. In this paper, we propose a learned compression framework that trains neural codecs directly from pretrained language model weights. Unlike conventional data (e.g., images), language model weights pose unique challenges: the sizes and shapes of weight tensors vary significantly, and the reconstruction quality must be judged by downstream model predictions rather than naïve MSE loss. To address this, we introduce Neural Weight Compression (NWC), a novel autoencoder-based neural codec tailored to model weight compression. The proposed method inherits the advantages of autoencoder-based codecs while incorporating three technical components: (1) column-wise tensor chunking and normalization; (2) an importance-aware training loss; (3) an inference-time error compensation mechanism guided by model outputs. Experiments on open-weight language models show that NWC achieves competitive or state-of-the-art accuracy-compression tradeoffs, with particularly strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.

Paper Structure

This paper contains 27 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Operational diagrams of various weight compression paradigms, and corresponding minimization objectives. Learnable paths are marked in blue. $g, g^{-1}:$ denote the transforms and their inverse; $\cdot$: random matrix generation and multiplication; $U/Q:$ Uniform noise / Quantization; $\operatorname{Search}:$ grid search of codewords and coefficients; $g_{a}, g_s :$ trainable analysis & synthesis networks.
  • Figure 2: Rate-distortion curves of codecs assuming heavy-tailed and Gaussian distributions. (a) When using a codec optimized for Gaussian data. (b) Comparing different codecs on Laplacian data. (c) Comparing of neural codecs trained on synthetic Gaussian data vs. on actual model weights.
  • Figure 3: A visual description of the proposed neural weight compression (NWC) framework. (Left) Preprocessing steps for the weight tensors, including column-wise chunking and normalization. (Right) Model architectures of the analysis and synthesis networks.
  • Figure 4: Compression results of Llama 3-8B across various bitrates. (a) WikiText-2 perplexity results (b) Zero-shot accuracy of MMLU benchmark.
  • Figure 5: C4 perplexities of various Llama models.
  • ...and 6 more figures