Table of Contents
Fetching ...

IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

Gihwan Kim, Jemin Lee, Hyungshin Kim

TL;DR

This work addresses the challenge of deploying vision transformers with fully integer-only inference without retraining. It introduces IPTQ-ViT, a post-training quantization framework that combines Data-aware Poly-GELU and Efficient Bit-exp for Softmax with a Unified Metric to assign per-layer non-linear approximations, forming a mixed-quantized model that remains fully integer-based after calibration. The approach achieves accuracy improvements over prior PTQ methods and attains latency comparable to integer-only QAT methods, with demonstrated gains on ImageNet and COCO. The proposed methods enable practical deployment of ViTs on resource-constrained devices and are accompanied by code release for reproducibility.

Abstract

Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.

IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

TL;DR

This work addresses the challenge of deploying vision transformers with fully integer-only inference without retraining. It introduces IPTQ-ViT, a post-training quantization framework that combines Data-aware Poly-GELU and Efficient Bit-exp for Softmax with a Unified Metric to assign per-layer non-linear approximations, forming a mixed-quantized model that remains fully integer-based after calibration. The approach achieves accuracy improvements over prior PTQ methods and attains latency comparable to integer-only QAT methods, with demonstrated gains on ImageNet and COCO. The proposed methods enable practical deployment of ViTs on resource-constrained devices and are accompanied by code release for reproducibility.

Abstract

Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44\%p (avg. 1.78\%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.

Paper Structure

This paper contains 18 sections, 9 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Activation distributions of GELU in the 11-th block of ViT-B, which shows the highest quantization sensitivity in Tab. \ref{['tab:motivation_qat2ptq_sensitivity']} (I-BERT${}^\ast$). Visualized for (a) full-precision, (b) PTQ-quantized I-BERT, (c) our method, and (d) QAT-quantized I-BERT, with token sub-sampling applied. Both (b) and (d) use i-GELU i-bert of the QAT-based approximation. (b) shows massive imbalance, highlighting the limitation of applying QAT-designed methods to PTQ settings in vision tasks. More results are presented in Appendix Fig 4.
  • Figure 2: Overview of IPTQ-ViT pipeline. In stage 1, each non-linear layer is quantized with all candidate approximation functions and the Unified Metric is computed for each case. Stage 2 assigns an approximation function that has a maximum metric value per activation layer. Stage 3 calibrates the mixed quantized model.
  • Figure 3: Left: Comparison of erf approximations by baseline, I-BERT i-bert, and ours. Right: Comparison of GELU approximations by baseline, i-GELU i-bert, and ours.
  • Figure 4: Quantization runtime on DeiT-S under W8A8, measured on a single NVIDIA RTX 3090 GPU. Times exclude inference.