Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Jemin Lee; Yongin Kwon; Sihyeong Park; Misun Yu; Jeman Park; Hwanjun Song

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Jemin Lee, Yongin Kwon, Sihyeong Park, Misun Yu, Jeman Park, Hwanjun Song

TL;DR

Q-HyViT tackles the difficulty of post-training quantization for efficient hybrid vision transformers by identifying four unique challenges: highly dynamic activation ranges, bridge-block zero-point overflow, diverse normalizations, and sub-5M parameter models. It introduces a Hessian-based hybrid reconstruction error minimization framework that automatically selects layer-wise granularity (channel vs layer) and quantization scheme (symmetric vs asymmetric) for bridge and non-bridge components, including a dedicated treatment of bridge blocks. The method achieves substantial gains over existing PTQ baselines, with average improvements of up to 17.73% at 8-bit and 29.75% at 6-bit, and a 43.63% average improvement over FQ-ViT in fully quantized scenarios, across multiple hybrid ViT architectures. This work offers a practical path to deploying accurate, fully quantized hybrid ViTs on IoT devices, supported by a calibration-based workflow and detailed experimental validations. The approach also provides insights into the design of quantization strategies for mixed CNN–Transformer structures and highlights the importance of bridge-block-aware reconstruction for edge-friendly AI.

Abstract

Recently, vision transformers (ViTs) have superseded convolutional neural networks in numerous applications, including classification, detection, and segmentation. However, the high computational requirements of ViTs hinder their widespread implementation. To address this issue, researchers have proposed efficient hybrid transformer architectures that combine convolutional and transformer layers with optimized attention computation of linear complexity. Additionally, post-training quantization has been proposed as a means of mitigating computational demands. For mobile devices, achieving optimal acceleration for ViTs necessitates the strategic integration of quantization techniques and efficient hybrid transformer structures. However, no prior investigation has applied quantization to efficient hybrid transformers. In this paper, we discover that applying existing post-training quantization (PTQ) methods for ViTs to efficient hybrid transformers leads to a drastic accuracy drop, attributed to the four following challenges: (i) highly dynamic ranges, (ii) zero-point overflow, (iii) diverse normalization, and (iv) limited model parameters ($<$5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

TL;DR

Abstract

5M). To overcome these challenges, we propose a new post-training quantization method, which is the first to quantize efficient hybrid ViTs (MobileViTv1, MobileViTv2, Mobile-Former, EfficientFormerV1, EfficientFormerV2). We achieve a significant improvement of 17.73% for 8-bit and 29.75% for 6-bit on average, respectively, compared with existing PTQ methods (EasyQuant, FQ-ViT, PTQ4ViT, and RepQ-ViT)}. We plan to release our code at https://gitlab.com/ones-ai/q-hyvit.

Paper Structure (32 sections, 9 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 9 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Efficient Computer Vision Architecture
Model Quantization
Preliminary
Hybrid Vision Transformers and Bridge Blocks
Variants of Hybrid Vision Transformer
Bridge Blocks
Hybrid Vision Transformer Quantization
Challenges of Hybrid ViT Quantization
C1: Highly Dynamic Activation Range
C2: Zero-point Overflow in Bridge Block
C3: Quantization with Diverse Normalizations
C4: Sub-5M Parameter Models
Methodology
...and 17 more sections

Figures (9)

Figure 1: Overall quantization process of Q-HyViT on the representative structure of hybrid vision transformers, including local, global, and bridge representation.
Figure 2: Discrepancy in activation ranges between the calibration and validation datasets in 1st bridge block of MobileViTv2-100
Figure 3: Per activation channel ranges of convolution in bridge block of MobileViTv1-xxs
Figure 4: The selected problematic activation channels of convolution in bridge block of MobileViTv1-xxs due to overflow of zero point when using the channel-wise manner and asymmetric scheme
Figure 5: A histogram depicting the overlap between quantized values (blue) and real values (orange) for six activation layers in the 1st, 2nd, and 3rd bridge blocks of the MobileViTv1-xxs model
...and 4 more figures

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

TL;DR

Abstract

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (9)