Table of Contents
Fetching ...

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Mart van Baalen, Andrey Kuzmin, Ivan Koryakovskiy, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough

TL;DR

This work tackles the challenge of deploying large language models on mobile devices by reducing memory footprint and memory bandwidth through advanced quantization. It introduces GPTVQ, a fast post-training vector quantization method that extends GPTQ to non-uniform, multi-dimensional codebooks and uses Hessian-informed updates along with EM-initialized codebooks to achieve superior size-accuracy trade-offs. The approach is tailored for mobile hardware, leveraging LUT-based 6-bit indices and a hardware-aware inference stack, and demonstrates substantial footprint reductions and latency benefits across multiple LLMs and tasks, with further improvements when combined with LoRA adapters. Overall, GPTVQ offers a practical, mobile-friendly pathway to run powerful LLMs on devices with limited RAM and bandwidth, while maintaining competitive performance on standard NLP benchmarks.

Abstract

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

GPTVQ: The Blessing of Dimensionality for LLM Quantization

TL;DR

This work tackles the challenge of deploying large language models on mobile devices by reducing memory footprint and memory bandwidth through advanced quantization. It introduces GPTVQ, a fast post-training vector quantization method that extends GPTQ to non-uniform, multi-dimensional codebooks and uses Hessian-informed updates along with EM-initialized codebooks to achieve superior size-accuracy trade-offs. The approach is tailored for mobile hardware, leveraging LUT-based 6-bit indices and a hardware-aware inference stack, and demonstrates substantial footprint reductions and latency benefits across multiple LLMs and tasks, with further improvements when combined with LoRA adapters. Overall, GPTVQ offers a practical, mobile-friendly pathway to run powerful LLMs on devices with limited RAM and bandwidth, while maintaining competitive performance on standard NLP benchmarks.

Abstract

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
Paper Structure (44 sections, 7 equations, 2 figures, 19 tables, 2 algorithms)

This paper contains 44 sections, 7 equations, 2 figures, 19 tables, 2 algorithms.

Figures (2)

  • Figure 1: The proposed hardware-friendly representation and GPTVQ method. Top: During quantization, the FP16 weights are split into groups with their own small codebook. Bottom: During inference, the codebooks and indices are moved from DRAM to SoC independently from each other. The codebook is implemented as a lookup table (LUT) available on modern mobile CPUs.
  • Figure 2: Top: Illustration on how vector quantization can fit better 2D normal data, compared to uniform and non-uniform grids. Bottom: SQNR increases with quantization dimensionality on Llama-v2 7B weights, due to additional flexibility in the quantization grid.