Table of Contents
Fetching ...

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Yu-Shan Tai, An-Yeu, Wu

TL;DR

Vision transformers incur high compute and memory demands, making effective post-training quantization challenging due to asymmetric activation distributions. The authors propose MPTQ-ViT, a mixed-precision PTQ framework that combines SQ-b to reduce activation asymmetry, OPT-m to compute data-driven region-wise scaling factors for post-GeLU values, and Greedy MP to allocate layer-wide bit-width by balancing performance and compressibility. Empirical results on ViT, DeiT, and Swin demonstrate strong gains under both single-precision and mixed-precision quantization on ImageNet, with competitive results on COCO, significantly outperforming prior PTQ baselines at low bit-widths. Overall, the work shows that fine-grained, data-driven quantization parameters together with greedy layer-wise width allocation can dramatically improve compressibility and accuracy for ViTs in practical deployments.

Abstract

While vision transformers (ViTs) have shown great potential in computer vision tasks, their intense computation and memory requirements pose challenges for practical applications. Existing post-training quantization methods leverage value redistribution or specialized quantizers to address the non-normal distribution in ViTs. However, without considering the asymmetry in activations and relying on hand-crafted settings, these methods often struggle to maintain performance under low-bit quantization. To overcome these challenges, we introduce SmoothQuant with bias term (SQ-b) to alleviate the asymmetry issue and reduce the clamping loss. We also introduce optimal scaling factor ratio search (OPT-m) to determine quantization parameters by a data-dependent mechanism automatically. To further enhance the compressibility, we incorporate the above-mentioned techniques and propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT). We develop greedy mixed-precision quantization (Greedy MP) to allocate layer-wise bit-width considering both model performance and compressibility. Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset. Specifically, our proposed methods achieve accuracy improvements ranging from 0.90% to 23.35% on 4-bit ViTs with single-precision and from 3.82% to 78.14% on 5-bit fully quantized ViTs with mixed-precision.

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

TL;DR

Vision transformers incur high compute and memory demands, making effective post-training quantization challenging due to asymmetric activation distributions. The authors propose MPTQ-ViT, a mixed-precision PTQ framework that combines SQ-b to reduce activation asymmetry, OPT-m to compute data-driven region-wise scaling factors for post-GeLU values, and Greedy MP to allocate layer-wide bit-width by balancing performance and compressibility. Empirical results on ViT, DeiT, and Swin demonstrate strong gains under both single-precision and mixed-precision quantization on ImageNet, with competitive results on COCO, significantly outperforming prior PTQ baselines at low bit-widths. Overall, the work shows that fine-grained, data-driven quantization parameters together with greedy layer-wise width allocation can dramatically improve compressibility and accuracy for ViTs in practical deployments.

Abstract

While vision transformers (ViTs) have shown great potential in computer vision tasks, their intense computation and memory requirements pose challenges for practical applications. Existing post-training quantization methods leverage value redistribution or specialized quantizers to address the non-normal distribution in ViTs. However, without considering the asymmetry in activations and relying on hand-crafted settings, these methods often struggle to maintain performance under low-bit quantization. To overcome these challenges, we introduce SmoothQuant with bias term (SQ-b) to alleviate the asymmetry issue and reduce the clamping loss. We also introduce optimal scaling factor ratio search (OPT-m) to determine quantization parameters by a data-dependent mechanism automatically. To further enhance the compressibility, we incorporate the above-mentioned techniques and propose a mixed-precision post-training quantization framework for vision transformers (MPTQ-ViT). We develop greedy mixed-precision quantization (Greedy MP) to allocate layer-wise bit-width considering both model performance and compressibility. Our experiments on ViT, DeiT, and Swin demonstrate significant accuracy improvements compared with SOTA on the ImageNet dataset. Specifically, our proposed methods achieve accuracy improvements ranging from 0.90% to 23.35% on 4-bit ViTs with single-precision and from 3.82% to 78.14% on 5-bit fully quantized ViTs with mixed-precision.
Paper Structure (20 sections, 11 equations, 6 figures, 8 tables)

This paper contains 20 sections, 11 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Proposed mixed-precision post-training quantization framework for ViT (MPTQ-ViT). (a) SQ-b, (b) OPT-m, and (c) Greedy MP.
  • Figure 1: Box plots of block-wise post-GeLU values on (a) ViT-B and (b) DeiT-S.
  • Figure 2: OPT-m under 6-bit quantization. Neg-GeLU/Pos-GeLU are the histograms of negative/positive post-GeLU values.
  • Figure 3: L2 distance between $\mu$ and $\mu_r$ of ViT-L.
  • Figure 4: Distribution of negative (Neg) and positive (Pos) post-GeLU values of $9^{th}$ blocks of DeiT-S under 4-bit quantization: (a)(b) original, (c)(d) TSPTQ-ViT tsptq_vit, (e)(f) proposed OPT-m.
  • ...and 1 more figures