Table of Contents
Fetching ...

Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines

Chongyu Qu, Ritchie Zhao, Ye Yu, Bin Liu, Tianyuan Yao, Junchao Zhu, Bennett A. Landman, Yucheng Tang, Yuankai Huo

TL;DR

This work tackles the gap between the theoretical benefits of quantization and real-world deployment by proposing a real INT8 post-training quantization (PTQ) framework for 3D medical image segmentation. The method first applies fake quantization on ONNX to mimic INT8 behavior and then converts the model to a real INT8 TensorRT engine, enabling tangible reductions in model size and inference latency without retraining. Experiments on seven state-of-the-art 3D segmentation models across BTCV, Whole Brain, and TotalSegmentator V2 demonstrate 2.42–3.85x size reductions and 2.05–2.66x speedups while preserving mDSC. The approach handles large-scale architectures (including STU-Net-H) and shows strong generalization across architectures and datasets, suggesting practical utility for resource-constrained clinical deployments. Limitations include TensorRT compatibility with dynamic model components and the potential for even lower-bit quantization (e.g., INT4) in future work.

Abstract

Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.

Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines

TL;DR

This work tackles the gap between the theoretical benefits of quantization and real-world deployment by proposing a real INT8 post-training quantization (PTQ) framework for 3D medical image segmentation. The method first applies fake quantization on ONNX to mimic INT8 behavior and then converts the model to a real INT8 TensorRT engine, enabling tangible reductions in model size and inference latency without retraining. Experiments on seven state-of-the-art 3D segmentation models across BTCV, Whole Brain, and TotalSegmentator V2 demonstrate 2.42–3.85x size reductions and 2.05–2.66x speedups while preserving mDSC. The approach handles large-scale architectures (including STU-Net-H) and shows strong generalization across architectures and datasets, suggesting practical utility for resource-constrained clinical deployments. Limitations include TensorRT compatibility with dynamic model components and the potential for even lower-bit quantization (e.g., INT4) in future work.

Abstract

Quantizing deep neural networks ,reducing the precision (bit-width) of their computations, can remarkably decrease memory usage and accelerate processing, making these models more suitable for large-scale medical imaging applications with limited computational resources. However, many existing methods studied "fake quantization", which simulates lower precision operations during inference, but does not actually reduce model size or improve real-world inference speed. Moreover, the potential of deploying real 3D low-bit quantization on modern GPUs is still unexplored. In this study, we introduce a real post-training quantization (PTQ) framework that successfully implements true 8-bit quantization on state-of-the-art (SOTA) 3D medical segmentation models, i.e., U-Net, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, ST-UNet,and VISTA3D. Our approach involves two main steps. First, we use TensorRT to perform fake quantization for both weights and activations with unlabeled calibration dataset. Second, we convert this fake quantization into real quantization via TensorRT engine on real GPUs, resulting in real-world reductions in model size and inference latency. Extensive experiments demonstrate that our framework effectively performs 8-bit quantization on GPUs without sacrificing model performance. This advancement enables the deployment of efficient deep learning models in medical imaging applications where computational resources are constrained. The code and models have been released, including U-Net, TransUNet pretrained on the BTCV dataset for abdominal (13-label) segmentation, UNesT pretrained on the Whole Brain Dataset for whole brain (133-label) segmentation, and nnU-Net, SegResNet, SwinUNETR and VISTA3D pretrained on TotalSegmentator V2 for full body (104-label) segmentation. https://github.com/hrlblab/PTQ.

Paper Structure

This paper contains 12 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) PyTorch Models with FP32 Precision. Previous 3D medical image segmentation commonly uses FP32 models, which results in larger model sizes, higher computational demands, and slower inference. As medical datasets continue to grow, improving model efficiency becomes increasingly important. (b) TensorRT Engine with INT8 Precision. We propose a real PTQ framework using NVIDIA TensorRT to convert FP32 models into INT8, enabling notable reductions in both model size and inference latency without compromising performance. For example, U-Net's model size shrinks from 23.11 MB to 6.61 MB, and its inference latency drops from 2.62 ms to 1.05 ms, while maintaining the same mean Dice Score (mDSC) of 0.822. (c) Inference Latency vs. Model Size. We evaluate seven medical segmentation models, i.e., UNet, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet, and VISTA3D, before and after our PTQ framework. Compared with their original FP32 versions (orange), our INT8 models (green) achieve clear smaller model sizes and inference latency, indicating superior efficiency.
  • Figure 2: Comparison between real quantization and fake quantization. We compare the real quantization and fake quantization on seven medical segmentation models, i.e., VISTA3D, SegResNet, SwinUNETR, nnU-Net, UNesT, TransUNet and U-Net across three datasets with varying sample sizes (N) and label counts (C), i.e., TotalSegmentator V2 ($N=200, C=104$), Whole Brain ($N=50, C=133$) and BTCV ($N=20, C=13$). The left panel compares model sizes for INT8 (real quant), INT8 (fake quant), and the original FP32 models, while the right panel compares their inference latencies. As shown, fake quantization only simulates low-precision computation and provides no real-world reduction in model size or latency. In contrast, our real quantization reduces model size by a factor of $2.42\times$ to $3.85\times$ and speeds up inference by $2.05\times$ to $2.66\times$.
  • Figure 3: Post-training quantization framework. We first convert the original PyTorch model into the ONNX format. Next, we simulate quantization by adding QuantizeLinear and DequantizeLinear nodes into the ONNX model using a calibration dataset to create a fake quantized model (§\ref{['sec:fake_quant']}); this step simulates the INT8 quantization process but still relies on FP32 resources. Finally, we convert this fake quantized model into a real INT8 quantized engine using NVIDIA TensorRT (§\ref{['sec:real_quant']}). During this conversion, TensorRT detects the QuantizeLinear and DequantizeLinear nodes to perform actual INT8 quantization, and ReLU layers are fused into preceding layers for performance optimization.