Neural Weight Compression for Language Models
Jegwang Ryu, Minkyu Kim, Seungjun Shin, Hee Min Choi, Dokwan Oh, Jaeho Lee
TL;DR
This work tackles the challenge of storing and transmitting the weights of large language models by introducing Neural Weight Compression (NWC), a learned autoencoder-based neural codec trained directly on pretrained weight tensors. NWC handles diverse weight shapes via column-wise chunking, uses an importance-aware loss to allocate precision where it matters, and employs inference-time error compensation to mitigate reconstruction damage on downstream model predictions. The approach yields competitive or state-of-the-art rate–distortion performance, particularly at mid-to-high bitrates around $4$–$6$ bits, and generalizes to vision encoders beyond language models. While hardware support for a neural decoder remains a bottleneck for practical speedups, the framework offers a promising data-driven pathway for scalable, cross-domain weight compression and potential hardware co-design opportunities.
Abstract
The efficient storage and transmission of language model weights is becoming increasingly important, as their scale and adoption continue to grow. However, as our understanding of this new data modality is limited, designing a good compression algorithm for language model weights heavily relies on manual, trial-and-error approaches. In this paper, we propose a learned compression framework that trains neural codecs directly from pretrained language model weights. Unlike conventional data (e.g., images), language model weights pose unique challenges: the sizes and shapes of weight tensors vary significantly, and the reconstruction quality must be judged by downstream model predictions rather than naïve MSE loss. To address this, we introduce Neural Weight Compression (NWC), a novel autoencoder-based neural codec tailored to model weight compression. The proposed method inherits the advantages of autoencoder-based codecs while incorporating three technical components: (1) column-wise tensor chunking and normalization; (2) an importance-aware training loss; (3) an inference-time error compensation mechanism guided by model outputs. Experiments on open-weight language models show that NWC achieves competitive or state-of-the-art accuracy-compression tradeoffs, with particularly strong results at 4-6 bit precisions where accuracy remains nearly on par with FP16 models.
