Table of Contents
Fetching ...

PTQ4SAM: Post-Training Quantization for Segment Anything

Chengtao Lv, Hong Chen, Jinyang Guo, Yifu Ding, Xianglong Liu

TL;DR

PTQ4SAM tackles the practical deployment challenge of Segment Anything Model (SAM) by introducing a tailored post-training quantization framework. It identifies two SAM-specific bottlenecks: bimodal distributions in post-Key-Linear activations and highly heterogeneous post-Softmax distributions across attention types, and addresses them with Bimodal Integration (BIG) and Adaptive Granularity Quantization (AGQ). BIG uses a per-channel sign-driven transformation to convert bimodal activations into a normal distribution, while AGQ searches for a hardware-friendly Softmax base to balance granularity across attention scores. The method is plug-and-play for both statistic-based and learning-based PTQ pipelines, delivering substantial FLOPs and storage savings with minimal or zero loss in accuracy, including lossless performance at 6-bit for SAM-L on instance segmentation and strong gains across instance/semantic segmentation and object detection. Overall, PTQ4SAM enables efficient SAM inference on resource-constrained devices and highlights SAM-specific quantization strategies as a practical path forward.

Abstract

Segment Anything Model (SAM) has achieved impressive performance in many computer vision tasks. However, as a large-scale model, the immense memory and computation costs hinder its practical deployment. In this paper, we propose a post-training quantization (PTQ) framework for Segment Anything Model, namely PTQ4SAM. First, we investigate the inherent bottleneck of SAM quantization attributed to the bimodal distribution in post-Key-Linear activations. We analyze its characteristics from both per-tensor and per-channel perspectives, and propose a Bimodal Integration strategy, which utilizes a mathematically equivalent sign operation to transform the bimodal distribution into a relatively easy-quantized normal distribution offline. Second, SAM encompasses diverse attention mechanisms (i.e., self-attention and two-way cross-attention), resulting in substantial variations in the post-Softmax distributions. Therefore, we introduce an Adaptive Granularity Quantization for Softmax through searching the optimal power-of-two base, which is hardware-friendly. Extensive experimental results across various vision tasks (instance segmentation, semantic segmentation and object detection), datasets and model variants show the superiority of PTQ4SAM. For example, when quantizing SAM-L to 6-bit, we achieve lossless accuracy for instance segmentation, about 0.5\% drop with theoretical 3.9$\times$ acceleration. The code is available at \url{https://github.com/chengtao-lv/PTQ4SAM}.

PTQ4SAM: Post-Training Quantization for Segment Anything

TL;DR

PTQ4SAM tackles the practical deployment challenge of Segment Anything Model (SAM) by introducing a tailored post-training quantization framework. It identifies two SAM-specific bottlenecks: bimodal distributions in post-Key-Linear activations and highly heterogeneous post-Softmax distributions across attention types, and addresses them with Bimodal Integration (BIG) and Adaptive Granularity Quantization (AGQ). BIG uses a per-channel sign-driven transformation to convert bimodal activations into a normal distribution, while AGQ searches for a hardware-friendly Softmax base to balance granularity across attention scores. The method is plug-and-play for both statistic-based and learning-based PTQ pipelines, delivering substantial FLOPs and storage savings with minimal or zero loss in accuracy, including lossless performance at 6-bit for SAM-L on instance segmentation and strong gains across instance/semantic segmentation and object detection. Overall, PTQ4SAM enables efficient SAM inference on resource-constrained devices and highlights SAM-specific quantization strategies as a practical path forward.

Abstract

Segment Anything Model (SAM) has achieved impressive performance in many computer vision tasks. However, as a large-scale model, the immense memory and computation costs hinder its practical deployment. In this paper, we propose a post-training quantization (PTQ) framework for Segment Anything Model, namely PTQ4SAM. First, we investigate the inherent bottleneck of SAM quantization attributed to the bimodal distribution in post-Key-Linear activations. We analyze its characteristics from both per-tensor and per-channel perspectives, and propose a Bimodal Integration strategy, which utilizes a mathematically equivalent sign operation to transform the bimodal distribution into a relatively easy-quantized normal distribution offline. Second, SAM encompasses diverse attention mechanisms (i.e., self-attention and two-way cross-attention), resulting in substantial variations in the post-Softmax distributions. Therefore, we introduce an Adaptive Granularity Quantization for Softmax through searching the optimal power-of-two base, which is hardware-friendly. Extensive experimental results across various vision tasks (instance segmentation, semantic segmentation and object detection), datasets and model variants show the superiority of PTQ4SAM. For example, when quantizing SAM-L to 6-bit, we achieve lossless accuracy for instance segmentation, about 0.5\% drop with theoretical 3.9 acceleration. The code is available at \url{https://github.com/chengtao-lv/PTQ4SAM}.
Paper Structure (24 sections, 10 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 10 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The histogram of two special distributions in SAM: (a) bimodal distribution in post-Key-Linear activations. (b) post-Softmax distributions of self-attention, token-to-image cross-attention and image-to-token cross-attention.
  • Figure 2: Illustration of our proposed PTQ4SAM. The Bimodal Integration eliminates the bimodal distribution by simultaneously multiplying a channel-wise $\boldsymbol{\gamma}$ to both the query and key linears. The Adaptive Granularity Quantization is employed for post-softmax distribution.
  • Figure 3: Boxplot of different channels of post-Key-Linear activations in SAM.
  • Figure 4: (a) Theoretical acceleration rate (100 prompts) vs. all SAM models. (b) Accuracy vs. storage.
  • Figure 5: Visualization of instance segmentation on 4-bit SAM-L.
  • ...and 4 more figures