Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines
Chongyu Qu, Ritchie Zhao, Ye Yu, Bin Liu, Tianyuan Yao, Junchao Zhu, Bennett A. Landman, Yucheng Tang, Yuankai Huo
TL;DR
This work tackles the gap between the theoretical benefits of quantization and real-world deployment by proposing a real INT8 post-training quantization (PTQ) framework for 3D medical image segmentation. The method first applies fake quantization on ONNX to mimic INT8 behavior and then converts the model to a real INT8 TensorRT engine, enabling tangible reductions in model size and inference latency without retraining. Experiments on seven state-of-the-art 3D segmentation models across BTCV, Whole Brain, and TotalSegmentator V2 demonstrate 2.42–3.85x size reductions and 2.05–2.66x speedups while preserving mDSC. The approach handles large-scale architectures (including STU-Net-H) and shows strong generalization across architectures and datasets, suggesting practical utility for resource-constrained clinical deployments. Limitations include TensorRT compatibility with dynamic model components and the potential for even lower-bit quantization (e.g., INT4) in future work.
Abstract
Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.
