Table of Contents
Fetching ...

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu

TL;DR

decoupleQ tackles the challenge of achieving accurate 2-bit post-training quantization for very large models by decoupling weight parameters into an integer component and a floating-point component, reframing quantization as a constrained optimization. The method alternates between layer-wise optimization of (integer, floating-point) parts and a block-wise refinement that freezes the integer part while tuning the floating-point components and normalization layers, with two practical approximation schemes to manage non-convexity. Empirically, decoupleQ delivers accuracy close to FP16/BF16 on 2-bit quantization for large ASR models and outperforms several PTQ baselines on public benchmarks (ImageNet, Llama) while maintaining hardware-friendly, uniform quantization. The approach is extensible to supervised fine-tuning for downstream tasks and provides a practical path toward industrial deployment, backed by open-source code.

Abstract

Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

TL;DR

decoupleQ tackles the challenge of achieving accurate 2-bit post-training quantization for very large models by decoupling weight parameters into an integer component and a floating-point component, reframing quantization as a constrained optimization. The method alternates between layer-wise optimization of (integer, floating-point) parts and a block-wise refinement that freezes the integer part while tuning the floating-point components and normalization layers, with two practical approximation schemes to manage non-convexity. Empirically, decoupleQ delivers accuracy close to FP16/BF16 on 2-bit quantization for large ASR models and outperforms several PTQ baselines on public benchmarks (ImageNet, Llama) while maintaining hardware-friendly, uniform quantization. The approach is extensible to supervised fine-tuning for downstream tasks and provides a practical path toward industrial deployment, backed by open-source code.

Abstract

Quantization emerges as one of the most promising compression technologies for deploying efficient large models for various real time application in recent years. Considering that the storage and IO of weights take up the vast majority of the overhead inside a large model, weight only quantization can lead to large gains. However, existing quantization schemes suffer from significant accuracy degradation at very low bits, or require some additional computational overhead when deployed, making it difficult to be applied to large-scale applications in industry. In this paper, we propose decoupleQ, achieving a substantial increase in model accuracy, especially at very low bits. decoupleQ abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, thus transforming the quantization problem into a traditional mathematical optimization problem with constraints, which is then solved alternatively by off-the-shelf optimization methods. Quantization via decoupleQ is linear and uniform, making it hardware-friendlier than non-uniform counterpart, and enabling the idea to be migrated to high-bit quantization to enhance its robustness. Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance. The code is available at https://github.com/bytedance/decoupleQ
Paper Structure (16 sections, 12 equations, 4 figures, 5 tables)

This paper contains 16 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The solid lines represent the top-1 accuracy of ResNet-18 on ImageNet w.r.t. the number of iterations $K$ when using approximation \ref{['level1']}; while the dashed lines are for the approximation \ref{['level2']}. The blue line represents quantization via decoupleQ, with only the layer-wise minimization used. The red line represents the addition of one-epoch sft to the blue line.
  • Figure 2: The solid line represents the PPL of Llama-7B on WikiText2 w.r.t. the number of iterations $K$ when using approximation \ref{['level1']}; while the dashed line is for the approximation \ref{['level2']}. The horizontal axis represents $K$, and the vertical axis represents PPL. The model is quantized into W2A16, and block-wise minimization is not used in this experiment. It shows that, when $K>1$, solving approximation \ref{['level1']} yields better model accuracy than approximation \ref{['level2']}.
  • Figure 3: The PPL of Llama-7B on WikiText2 and the loss of the first block between pre-and post-quantization w.r.t. the number of iterations $K$ when using approximation \ref{['level1']}. The dashed line is for the approximation \ref{['level2']}. The model is quantized into W2A16, and both the layer-wise minimization and block-wise minimization are used. The model's best PPL is where $K=1$, and then fluctuates within a range as $K$ increases. But all PPLs are inferior to when the approximation \ref{['level2']} is used. The loss, defined in \ref{['block-min']}, of the first block between pre-and post quantization is plotted on the right vertical axis. As $K$ increases, the loss decreases strictly monotonically, and when $K > 2$, the loss falls below the case when the approximation \ref{['level2']} is used.
  • Figure 4: The perplexity of Llama-7B on WikiText2 and C4 dataset w.r.t. the number of segments as calibration datasets. The model is quantized into W2A16g64.