PocketLLM: Ultimate Compression of Large Language Models via Meta Networks
Ye Tian, Chengcheng Wang, Jing Han, Yehui Tang, Kai Han
TL;DR
PocketLLM tackles the challenge of edge deployment for ever-larger LLMs by compressing weights in a latent space using meta-encoders/decoders and a compact codebook. The approach encodes weight subvectors into discrete latent representations, replaces them with codebook indices, and reconstructs the original weights with a meta decoder, requiring only the decoder, codebook, and indices to be stored. Empirical results show state-of-the-art performance at extreme compression ratios (e.g., up to $20\times$) with optional fine-tuning restoring dense-model accuracy, highlighting the method’s practicality for edge devices and low-bandwidth environments. This latent-space compression reduces storage and transmission costs while maintaining strong zero-shot performance across multiple benchmarks, signaling a substantial advancement in scalable LLM deployment. The work introduces robust components like Reshaped Layer Normalization and a straight-through estimator for non-differentiable codebook lookups, contributing broadly to latent-variable quantization techniques in large-scale models.
Abstract
As Large Language Models (LLMs) continue to grow in size, storing and transmitting them on edge devices becomes increasingly challenging. Traditional methods like quantization and pruning struggle to achieve extreme compression of LLMs without sacrificing accuracy. In this paper, we introduce PocketLLM, a novel approach to compress LLMs in a latent space via meta-networks. A simple encoder network is proposed to project the weights of LLMs into discrete latent vectors, which are then represented using a compact codebook. A lightweight decoder network is employed to map the codebook's representative vectors back to the original weight space. This method allows for significant compression of the large weights in LLMs, consisting solely of a small decoder, a concise codebook, and an index. Extensive experiments show that PocketLLM achieves superior performance even at significantly high compression ratios, e.g., compressing Llama 2-7B by 10x with a negligible drop in accuracy.
