Table of Contents
Fetching ...

FBQuant: FeedBack Quantization for Large Language Models

Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du

TL;DR

FBQuant tackles the challenge of deploying large language models on edge devices by addressing memory-bandwidth bottlenecks through weight-only quantization, while mitigating the resulting accuracy loss with a novel feedback-based sub-branch mechanism. The method upper-bounds weight reconstruction via $W_F = \mathcal{Q}(W-\Sigma) + \Sigma$ and enables differentiable optimization of the sub-branch adapters, together with a CUDA kernel fusion that reduces latency by about 60%. Empirical results show FBQuant achieving state-of-the-art perplexity and zero-shot accuracy across multiple models (e.g., 3-bit Llama2-7B gains 1.2% in zero-shot accuracy) and significantly improved wall-clock throughput on real devices. These contributions advance practical on-device LLM deployment by delivering both accuracy and efficiency gains with sub-branch quantization. FBQuant thus offers a scalable, calibration-data-efficient path toward robust, low-latency edge inference for modern LLMs.

Abstract

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.

FBQuant: FeedBack Quantization for Large Language Models

TL;DR

FBQuant tackles the challenge of deploying large language models on edge devices by addressing memory-bandwidth bottlenecks through weight-only quantization, while mitigating the resulting accuracy loss with a novel feedback-based sub-branch mechanism. The method upper-bounds weight reconstruction via and enables differentiable optimization of the sub-branch adapters, together with a CUDA kernel fusion that reduces latency by about 60%. Empirical results show FBQuant achieving state-of-the-art perplexity and zero-shot accuracy across multiple models (e.g., 3-bit Llama2-7B gains 1.2% in zero-shot accuracy) and significantly improved wall-clock throughput on real devices. These contributions advance practical on-device LLM deployment by delivering both accuracy and efficiency gains with sub-branch quantization. FBQuant thus offers a scalable, calibration-data-efficient path toward robust, low-latency edge inference for modern LLMs.

Abstract

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.

Paper Structure

This paper contains 25 sections, 32 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Impact of weight-only quantization on the RTX 3090 GPU. (Left) For Llama2-7B, the INT4 model processes 1,024 tokens for prefilling and 80 new tokens for decoding in only 60% of the time required by FP16. (Right) After loading to the GPU device, the INT4 model consumes just 25% of the memory used by FP16.
  • Figure 2: Three categories of optimization methods for weight-only quantization: Clamping, Rotation, and Sub-branching.
  • Figure 3: (Left) The main path incorporates feedback signals from the sub-branch to facilitate improved weight quantization, where $\hat{\mathbf{W}}$ represents the quantized weights in the main path, obtained via a quantizer $\mathcal{Q}(\cdot)$, and $\mathbf{\Sigma}$ denotes the weights in the sub-branch. (Right) Direct quantization of the original weights (red) maps them to the nearest quantization bins (blue). In contrast, the FBQuant method (green) applies a multi-step quantization approach, progressively adjusting the weights towards their original values in three stages.
  • Figure 4: Macs and latency of the linear layer in Llama2-7B. (Up-left) The MACs introduced by the main path $\mathbf{WX}^\top$ and the sub-branch $\mathbf{BAX}^\top$ are $M_0=b\times d\times d$ and $M_1=2\times b\times r\times d$, respectively, where $b$ is the batch size, $r$ is the rank value, and $d$ is the layer dimension. This results in $M_1/M_0=6.25\%$ additional MACs, when $r=128$ and $d=4096$. However, naively implementing this sub-branch significantly increases the latency by 20% when prefilling (right), and up to four times when decoding (bottom-left). FBQuant significantly mitigates the problem caused by the kernel fusion approach.
  • Figure 5: Kernel Fusion. We integrate the de-quantization and the linear projection in the main path, and up-projection in the sub-branch into the same kernel. The reduced number of kernels results in reduced kernel launch time. The integration reduces repeated writes to output activations.
  • ...and 2 more figures