Table of Contents
Fetching ...

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Rasoul Shafipour, David Harrison, Maxwell Horton, Jeffrey Marker, Houman Bedayat, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, Saman Naderiparizi

TL;DR

SeedLM is a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights, which reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses.

Abstract

Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art compression methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama 3 70B, which is particularly challenging to compress, show that SeedLM achieves significantly better zero-shot accuracy retention at 4- and 3-bit than state-of-the-art techniques, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases to 70B, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

TL;DR

SeedLM is a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights, which reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses.

Abstract

Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of pseudo-random generators to encode and compress model weights. Specifically, for each block of weights, we find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block. SeedLM reduces memory access and leverages idle compute cycles during inference, effectively speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art compression methods that rely on calibration data, our approach is data-free and generalizes well across diverse tasks. Our experiments with Llama 3 70B, which is particularly challenging to compress, show that SeedLM achieves significantly better zero-shot accuracy retention at 4- and 3-bit than state-of-the-art techniques, while maintaining performance comparable to FP16 baselines. Additionally, FPGA-based tests demonstrate that 4-bit SeedLM, as model size increases to 70B, approaches a 4x speed-up over an FP16 Llama 2/3 baseline.

Paper Structure

This paper contains 13 sections, 6 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Retained zero-shot accuracy across a variety of tasks and compression methods, compared to the FP16 Llama 3 70B model. The top row shows data for 4-bit compression, while the bottom row shows data for 3-bit compression. We compare the performance of SeedLM, AWQ, and OmniQuant across the ARC-Easy, ARC-Challenge, HellaSwag, WinoGrande, and BoolQ tasks. While being completely data-free, SeedLM outperforms state-of-the-art weight quantization methods that rely on a calibration dataset.
  • Figure 2: Compression of weights using pseudo-random generated matrices.
  • Figure 3: Illustration of the state sequence for a $K\!=\!3$ LFSR with all possible states with the feedback polynomial defined in Table \ref{['appendix:lfsr_coeff']}. The matrix $\mathbf{V}(4)$ starts filling with the value generated one cycle after the seed state $s\!=\!4$, which is highlighted with a thick circle.
  • Figure 4: Block diagram of the RTL design.