Table of Contents
Fetching ...

QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

Ke Xu, Yixin Wang, Zhongcheng Li, Hao Cui, Jinshui Hu, Xingyi Zhang

TL;DR

QuEPT addresses the challenge of deploying transformer models across diverse quantization budgets by enabling elastic, post-training multi-bit quantization with a single calibration pass. It introduces two key innovations: Multi-Bit Token Merging (MB-ToMe) to fuse token features across bit-widths while preserving high-bit information, and Multi-Bit Cascaded LoRA (MB-CLoRA) to hierarchically share low-rank adapters across bit-width groups, enabling real-time switching among uniform and mixed precision without re-optimization. The method is formulated as block-wise reconstruction with joint optimization over low-rank adapters and clipping parameters across a bit-width set $\,\mathcal{B}$, and uses a three-tier bit-width sampling pipeline to train a single model usable at any predefined width by selecting the corresponding LoRA slice. Empirical results on ViTs, LLaMA, and LLaVA-OV demonstrate competitive or superior performance to state-of-the-art PTQ methods, with substantial reductions in training overhead and the ability to deploy mixed-precision configurations in a training-free manner. This approach broadens practical deployment of large transformers by balancing accuracy, robustness, and efficiency across a spectrum of quantization budgets, while highlighting future work on extreme low-bit scenarios and explicit outlier mitigation.

Abstract

Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT

QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching

TL;DR

QuEPT addresses the challenge of deploying transformer models across diverse quantization budgets by enabling elastic, post-training multi-bit quantization with a single calibration pass. It introduces two key innovations: Multi-Bit Token Merging (MB-ToMe) to fuse token features across bit-widths while preserving high-bit information, and Multi-Bit Cascaded LoRA (MB-CLoRA) to hierarchically share low-rank adapters across bit-width groups, enabling real-time switching among uniform and mixed precision without re-optimization. The method is formulated as block-wise reconstruction with joint optimization over low-rank adapters and clipping parameters across a bit-width set , and uses a three-tier bit-width sampling pipeline to train a single model usable at any predefined width by selecting the corresponding LoRA slice. Empirical results on ViTs, LLaMA, and LLaVA-OV demonstrate competitive or superior performance to state-of-the-art PTQ methods, with substantial reductions in training overhead and the ability to deploy mixed-precision configurations in a training-free manner. This approach broadens practical deployment of large transformers by balancing accuracy, robustness, and efficiency across a spectrum of quantization budgets, while highlighting future work on extreme low-bit scenarios and explicit outlier mitigation.

Abstract

Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization scenarios.Yet, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language models.This paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization methods.Our code is available at https://github.com/xuke225/QuEPT
Paper Structure (24 sections, 9 equations, 4 figures, 6 tables)

This paper contains 24 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Overview of our proposed QuEPT. QuEPT calibrates the low-rank compensation matrix $\bm R$ and weight clipping parameters $\bm \alpha$ and $\bm \beta$ in block-wise reconstruction, weights $\bm W$ and quantization scale $\bm S$ are frozen. The reconstruction process consists of two stages: (1) merging multiple bit-width features from different bit groups by Multi-Bit Token Merging (MB-ToMe); (2) optimizing multi-bit quantization loss using a Multi-Bit Cascaded Low-Rank Adapters (MB-CLoRA).
  • Figure 2: Three cases of multi-bit token merging strategy.
  • Figure 3: Feature distribution comparison for top 4 divergent tokens on the input of block 1 of LLaMA2-7b. Token sorted by Kolmogorov-Smirnov (K-S) statistic.
  • Figure 4: Results of Mixed Precision of LLaMA2-7B.