Table of Contents
Fetching ...

Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

Patrick Yubeaton, Tareq Mahmoud, Shehab Naga, Pooria Taheri, Tianhua Xia, Arun George, Yasmein Khalil, Sai Qian Zhang, Siddharth Joshi, Chinmay Hegde, Siddharth Garg

TL;DR

Huff-LLM introduces an end-to-end lossless compression scheme for LLM weights that enables storing and operating on compressed weights across cloud, disk, main memory, and on-chip buffers. By splitting FP16 weights into small, Huffman-encoded subsets and inserting lightweight decoders into standard hardware accelerators (systolic arrays and Simba-like vector units), the approach decompresses weights on-the-fly with minimal throughput impact. Across multiple LLM families (3B–13B) and formats (FP16/BF16), Huff-LLM achieves up to 32% model-size reduction, up to 31% lower latency, and up to 26% energy savings, with area overhead around 6%. This lossless, hardware-friendly design preserves model behavior while delivering practical gains in memory bandwidth, on-chip storage, and inference efficiency, making larger models more feasible on edge devices.

Abstract

As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an \emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format \emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.

Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

TL;DR

Huff-LLM introduces an end-to-end lossless compression scheme for LLM weights that enables storing and operating on compressed weights across cloud, disk, main memory, and on-chip buffers. By splitting FP16 weights into small, Huffman-encoded subsets and inserting lightweight decoders into standard hardware accelerators (systolic arrays and Simba-like vector units), the approach decompresses weights on-the-fly with minimal throughput impact. Across multiple LLM families (3B–13B) and formats (FP16/BF16), Huff-LLM achieves up to 32% model-size reduction, up to 31% lower latency, and up to 26% energy savings, with area overhead around 6%. This lossless, hardware-friendly design preserves model behavior while delivering practical gains in memory bandwidth, on-chip storage, and inference efficiency, making larger models more feasible on edge devices.

Abstract

As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an \emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format \emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also show latency and energy efficiency improvements when performing inference with the compressed model.

Paper Structure

This paper contains 29 sections, 5 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Systolic array diagram with a detailed look at the PE. Weights and activations are sent to the PEs at every clock cycle. Bubbles indicate delays which are necessary to maintain accuracy of the computations performed by the output stationary architecture. Colors are associated to different operands. Weights are yellow, inputs are light blue, and partial-sum/outputs are pink
  • Figure 3: Area overhead of a CAM lookup for a single-cycle N-bit Huffman decoder normalized to a column of 128 FP16 multipliers, both clocked at 1 GhZ. Area overheads of Huffman decoding grow quickly, leaving only N={4,5} as viable options.
  • Figure 4: Our Huffman Compression method follows these steps for every parameter. It breaks a FP16 number into 4 groups of bits. The sign bit remains uncompressed. The exponent, and mantissa bits are sent through a Huffman Coder to be compressed. They are then stored in memory until they are needed for inference.
  • Figure 5: Systolic array and Huffman Decoder integration.
  • Figure 6: (a) Roofline plot of Systolic Arrays with 128GB/s and 256GB/s DRAM bandwidth. Dashed lines show the baseline and Huff-LLM models, with intersections marking operational points. (b) Energy breakdown of Huff-LLM and baseline model on the Systolic Array with 256GB/s DRAM bandwidth.
  • ...and 1 more figures