QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching
Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang
TL;DR
QuEPT addresses the challenge of deploying transformer models across diverse quantization budgets by enabling elastic, post-training multi-bit quantization with a single calibration pass. It introduces two key innovations: Multi-Bit Token Merging (MB-ToMe) to fuse token features across bit-widths while preserving high-bit information, and Multi-Bit Cascaded LoRA (MB-CLoRA) to hierarchically share low-rank adapters across bit-width groups, enabling real-time switching among uniform and mixed precision without re-optimization. The method is formulated as block-wise reconstruction with joint optimization over low-rank adapters and clipping parameters across a bit-width set $\,\mathcal{B}$, and uses a three-tier bit-width sampling pipeline to train a single model usable at any predefined width by selecting the corresponding LoRA slice. Empirical results on ViTs, LLaMA, and LLaVA-OV demonstrate competitive or superior performance to state-of-the-art PTQ methods, with substantial reductions in training overhead and the ability to deploy mixed-precision configurations in a training-free manner. This approach broadens practical deployment of large transformers by balancing accuracy, robustness, and efficiency across a spectrum of quantization budgets, while highlighting future work on extreme low-bit scenarios and explicit outlier mitigation.
Abstract
Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT
