Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference
Patrick Yubeaton, Tareq Mahmoud, Shehab Naga, Pooria Taheri, Tianhua Xia, Arun George, Yasmein Khalil, Sai Qian Zhang, Siddharth Joshi, Chinmay Hegde, Siddharth Garg
TL;DR
Huff-LLM introduces an end-to-end lossless compression scheme for LLM weights that enables storing and operating on compressed weights across cloud, disk, main memory, and on-chip buffers. By splitting FP16 weights into small, Huffman-encoded subsets and inserting lightweight decoders into standard hardware accelerators (systolic arrays and Simba-like vector units), the approach decompresses weights on-the-fly with minimal throughput impact. Across multiple LLM families (3B–13B) and formats (FP16/BF16), Huff-LLM achieves up to 32% model-size reduction, up to 31% lower latency, and up to 26% energy savings, with area overhead around 6%. This lossless, hardware-friendly design preserves model behavior while delivering practical gains in memory bandwidth, on-chip storage, and inference efficiency, making larger models more feasible on edge devices.
Abstract
As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an \emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format \emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.
