Table of Contents
Fetching ...

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

Jaewoo Song, Fangzhen Lin

TL;DR

This work tackles the challenge of deploying low-bit quantized LLMs in GPU-limited environments by introducing SplitQuantV2, a GPU-free preprocessing method that restructures linear and convolutional weights into three functionally equivalent sublayers via fixed three-way clustering. By preserving numerical functionality while increasing quantization resolution, SplitQuantV2 enables effective INT4 quantization without calibration data, achieving accuracy almost matching the original FP model on the Llama 3.2 1B Instruct after only about 2 minutes of CPU-based preprocessing. The method demonstrates significant practicality for edge devices and NPUs with limited tooling, delivering large-speedups over GPU-dependent quantization approaches like ZeroQuant. Overall, SplitQuantV2 lowers the barrier to accurate low-bit quantization of LLMs, expanding accessibility and deployment scenarios on CPU-only or GPU-constrained platforms.

Abstract

The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algorithms on various neural processing units (NPUs) and edge AI devices, which have diverse model formats and frameworks. In this paper, we show SplitQuantV2, an innovative algorithm designed to enhance low-bit linear quantization of LLMs, can achieve results comparable to those of advanced algorithms. SplitQuantV2 preprocesses models by splitting linear and convolution layers into functionally equivalent, quantization-friendly structures. The algorithm's platform-agnostic, concise, and efficient nature allows for implementation without the need for GPUs. Our evaluation on the Llama 3.2 1B Instruct model using the AI2's Reasoning Challenge (ARC) dataset demonstrates that SplitQuantV2 improves the accuracy of the INT4 quantization model by 11.76%p, matching the performance of the original floating-point model. Remarkably, SplitQuantV2 took only 2 minutes 6 seconds to preprocess the 1B model and perform linear INT4 quantization using only an Apple M4 CPU. SplitQuantV2 provides a practical solution for low-bit quantization on LLMs, especially when complex, computation-intensive algorithms are inaccessible due to hardware limitations or framework incompatibilities.

SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

TL;DR

This work tackles the challenge of deploying low-bit quantized LLMs in GPU-limited environments by introducing SplitQuantV2, a GPU-free preprocessing method that restructures linear and convolutional weights into three functionally equivalent sublayers via fixed three-way clustering. By preserving numerical functionality while increasing quantization resolution, SplitQuantV2 enables effective INT4 quantization without calibration data, achieving accuracy almost matching the original FP model on the Llama 3.2 1B Instruct after only about 2 minutes of CPU-based preprocessing. The method demonstrates significant practicality for edge devices and NPUs with limited tooling, delivering large-speedups over GPU-dependent quantization approaches like ZeroQuant. Overall, SplitQuantV2 lowers the barrier to accurate low-bit quantization of LLMs, expanding accessibility and deployment scenarios on CPU-only or GPU-constrained platforms.

Abstract

The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algorithms on various neural processing units (NPUs) and edge AI devices, which have diverse model formats and frameworks. In this paper, we show SplitQuantV2, an innovative algorithm designed to enhance low-bit linear quantization of LLMs, can achieve results comparable to those of advanced algorithms. SplitQuantV2 preprocesses models by splitting linear and convolution layers into functionally equivalent, quantization-friendly structures. The algorithm's platform-agnostic, concise, and efficient nature allows for implementation without the need for GPUs. Our evaluation on the Llama 3.2 1B Instruct model using the AI2's Reasoning Challenge (ARC) dataset demonstrates that SplitQuantV2 improves the accuracy of the INT4 quantization model by 11.76%p, matching the performance of the original floating-point model. Remarkably, SplitQuantV2 took only 2 minutes 6 seconds to preprocess the 1B model and perform linear INT4 quantization using only an Apple M4 CPU. SplitQuantV2 provides a practical solution for low-bit quantization on LLMs, especially when complex, computation-intensive algorithms are inaccessible due to hardware limitations or framework incompatibilities.

Paper Structure

This paper contains 12 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Modified and redrawn from SplitQuant [Song and Lin, 2025]. (A) Original linear or convolution layer. (B) SplitQuant improves quantization resolution by using k-means clustering on weights and biases to split the original layer into lower, middle and upper cluster layers. The functionality of the original layer is preserved.