Table of Contents
Fetching ...

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song

TL;DR

Q-HyViT tackles the difficulty of post-training quantization for efficient hybrid vision transformers by identifying four unique challenges: highly dynamic activation ranges, bridge-block zero-point overflow, diverse normalizations, and sub-5M parameter models. It introduces a Hessian-based hybrid reconstruction error minimization framework that automatically selects layer-wise granularity (channel vs layer) and quantization scheme (symmetric vs asymmetric) for bridge and non-bridge components, including a dedicated treatment of bridge blocks. The method achieves substantial gains over existing PTQ baselines, with average improvements of up to 17.73% at 8-bit and 29.75% at 6-bit, and a 43.63% average improvement over FQ-ViT in fully quantized scenarios, across multiple hybrid ViT architectures. This work offers a practical path to deploying accurate, fully quantized hybrid ViTs on IoT devices, supported by a calibration-based workflow and detailed experimental validations. The approach also provides insights into the design of quantization strategies for mixed CNN–Transformer structures and highlights the importance of bridge-block-aware reconstruction for edge-friendly AI.

Abstract

Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

TL;DR

Q-HyViT tackles the difficulty of post-training quantization for efficient hybrid vision transformers by identifying four unique challenges: highly dynamic activation ranges, bridge-block zero-point overflow, diverse normalizations, and sub-5M parameter models. It introduces a Hessian-based hybrid reconstruction error minimization framework that automatically selects layer-wise granularity (channel vs layer) and quantization scheme (symmetric vs asymmetric) for bridge and non-bridge components, including a dedicated treatment of bridge blocks. The method achieves substantial gains over existing PTQ baselines, with average improvements of up to 17.73% at 8-bit and 29.75% at 6-bit, and a 43.63% average improvement over FQ-ViT in fully quantized scenarios, across multiple hybrid ViT architectures. This work offers a practical path to deploying accurate, fully quantized hybrid ViTs on IoT devices, supported by a calibration-based workflow and detailed experimental validations. The approach also provides insights into the design of quantization strategies for mixed CNN–Transformer structures and highlights the importance of bridge-block-aware reconstruction for edge-friendly AI.

Abstract

Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters (5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.
Paper Structure (32 sections, 9 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 9 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overall quantization process of Q-HyViT on the representative structure of hybrid vision transformers, including local, global, and bridge representation.
  • Figure 2: Discrepancy in activation ranges between the calibration and validation datasets in 1st bridge block of MobileViTv2-100
  • Figure 3: Per activation channel ranges of convolution in bridge block of MobileViTv1-xxs
  • Figure 4: The selected problematic activation channels of convolution in bridge block of MobileViTv1-xxs due to overflow of zero point when using the channel-wise manner and asymmetric scheme
  • Figure 5: A histogram depicting the overlap between quantized values (blue) and real values (orange) for six activation layers in the 1st, 2nd, and 3rd bridge blocks of the MobileViTv1-xxs model
  • ...and 4 more figures