FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization
Fangxin Liu, Zongwu Wang, JinHong Xia, Junping Zhao, Shouren Zhao, Jinjin Li, Jian Liu, Li Jiang, Haibing Guan
TL;DR
This paper tackles the memory bottleneck in autoregressive LLM inference caused by rapidly growing model parameters. It introduces FlexQuant, a dynamic precision-switching framework that combines token-wise mixed precision with layer-wise switching guided by perplexity entropy and KL divergence, enabling adaptive, end-to-end optimization of speed and accuracy. Key contributions include a precision requirement analysis using $PPLE$, a fine-grained mixed-precision decoding scheme, and a token-wise switching manager that switches weights from INT8 to INT4 based on real-time throughput-accuracy signals, validated by up to a 1.3× speedup on long-context tasks with negligible accuracy loss. The approach offers a practical path to efficient LLM deployment under memory constraints by balancing memory bandwidth and computation through principled, data-driven precision control, with publicly released code for replication.
Abstract
The rapid advancement of large language models (LLMs) has exacerbated the memory bottleneck due to the widening gap between model parameter scaling and hardware capabilities. While post-training quantization techniques effectively reduce memory overhead, existing methods predominantly rely on static quantization strategies, which struggle to adapt to dynamic workloads. To address this, we propose FlexQuant, a dynamic precision-switching framework that optimizes the trade-off between inference speed and accuracy. Leveraging model perplexity entropy and Kullback-Leibler divergence, FlexQuant enables fine-grained, layer-wise mixed-precision quantization and dynamically adjusts bit-widths during each token generation. FlexQuant provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management. Evaluations demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss introduced. This framework offers a flexible and adaptive solution for efficient LLM deployment. Code is released at https://github.com/ZongwuWang/FlexQuant.git.
