pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Wenzheng Zhang; Bingzheng Liu; Yang Hu; Xiaoying Bai; Wentao Zhang; Bin Cui

pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang, Bin Cui

Abstract

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.

pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Abstract

Paper Structure (37 sections, 11 equations, 11 figures, 8 tables)

This paper contains 37 sections, 11 equations, 11 figures, 8 tables.

Introduction
Preliminary
Quantization Aware Training
Extremely Low-Bit Quantization
Sensitivity Analysis
Methodology
Multi-Head Attention in pQuant
Feed-Forward Network in pQuant
Efficient Scaling of pQuant
Experiments
Experimental setup
Performance of pQuant
Scalability of pQuant
Sensitivity Distribution of pQuant
Memory Efficiency of pQuant
...and 22 more sections

Figures (11)

Figure 1: An overview of performance (Perplexity on WikiText2) and bit-width achieved by our pQuant and other extremely low-bit methods on 1.3B model.
Figure 2: Weight log-sensitivities in the final FFN layer of LLaMA3-3B and BitNet-3B. Matrices are down sampled via max pooling for visualization; darker blue indicates higher sensitivity. Red boxes highlight regions of peak sensitivity. Notably, in the 1-bit weights of BitNet-3B, no pronounced sensitivity variation is observed.
Figure 3: Computational flow of pQuant’s core modules. (a) pQuant replaces all linear layers with quantized counterparts, with 8-bit branch to enable dynamic scaling when needed. (b-d) Computation in three representative pQuant modules. FP16 weights are retained solely during training to ensure numerical stability and discarded post-training. Here, $r$ denotes the dimension of weights in 8-bit branch, where $r \ll D_{\text{model}}$.
Figure 4: Final training loss with varying numbers of parameters. pQuant with N=8 demonstrates excellent scalability, whereas 1-bit BitNet do not.
Figure 5: (a) Sensitivity analysis of the 1-bit and 8-bit branches in the down-projection layer of the final FFN block in the 700M pQuant model. (b) Ablation study of pQuant on 700M-parameter model, assessing the impact of feature scaling and the number of active 8-bit branches on final loss. The mid-training loss drop stems from the two-phase learning rate schedule (see Appendix\ref{['appendix:Two-Phase Training Schedule']}).
...and 6 more figures

pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Abstract

pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training

Authors

Abstract

Table of Contents

Figures (11)