Table of Contents
Fetching ...

SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

Jing Zhang, Zhikai Li, Chengzhi Hu, Xuewen Liu, Qingyi Gu

TL;DR

The paper tackles the challenge of deploying Segment Anything Model (SAM) on resource-constrained devices by addressing two PTQ weaknesses: outlier-prone mask-decoder activations and misalignment between visual features and prompts. It presents SAQ-SAM, combining Perceptual-Consistency Clipping (PCC) to preserve semantic attention while aggressively clipping extreme activations, and Prompt-Aware Reconstruction (PAR) to align image-prompt interactions during reconstruction; a layer-skipping strategy further improves efficiency. Across instance segmentation, oriented object detection, and semantic segmentation, SAQ-SAM demonstrates strong gains in low-bit regimes, notably achieving an 11.7% higher mAP for 4-bit SAM-B on COCO instance segmentation. These results enable more practical edge deployment of SAM without substantial loss of performance, while preserving prompt-driven capabilities.

Abstract

Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100x), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder. Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages. For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7% higher mAP than the baseline in instance segmentation task.

SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

TL;DR

The paper tackles the challenge of deploying Segment Anything Model (SAM) on resource-constrained devices by addressing two PTQ weaknesses: outlier-prone mask-decoder activations and misalignment between visual features and prompts. It presents SAQ-SAM, combining Perceptual-Consistency Clipping (PCC) to preserve semantic attention while aggressively clipping extreme activations, and Prompt-Aware Reconstruction (PAR) to align image-prompt interactions during reconstruction; a layer-skipping strategy further improves efficiency. Across instance segmentation, oriented object detection, and semantic segmentation, SAQ-SAM demonstrates strong gains in low-bit regimes, notably achieving an 11.7% higher mAP for 4-bit SAM-B on COCO instance segmentation. These results enable more practical edge deployment of SAM without substantial loss of performance, while preserving prompt-driven capabilities.

Abstract

Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder's attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100x), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder. Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages. For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7% higher mAP than the baseline in instance segmentation task.

Paper Structure

This paper contains 21 sections, 12 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Visualization of extreme activation distributions in mask decoder and the performance of different clipping methods. QK activations in the mask decoder show highly skewed distributions, with most data concentrated in a narrow range while outliers can exceed 180 times the normal range. MSE provides an overly wide clipping range, whereas our Perceptual-Consistency Clipping (PCC) method can identify outliers more precisely.
  • Figure 2: Comparison of total inference time between image encoder and mask decoder in semantic segmentation task.
  • Figure 3: Attention heatmaps in the mask decoder with different quantization clipping methods. The distribution-aligned MSE leads to significant attention degradation, whereas our semantic-aligned PCC maintains the consistency with the FP model.
  • Figure 4: Overview of SAQ-SAM. The proposed PCC guides quantization clipping of QK activation by minimizing the Attention Focus deviation from FP, thereby semantically preserving the perceptual alignment. Our PAR incorporates image-prompt interactions into per-stage reconstruction, utilizing the off-the-shell module in the mask decoder. Through minimizing the interaction response error supervised by the FP model, quantization model learns correspondence between visual features and prompt intentions, thus facilitating dual alignment at both the distributional and semantic levels.
  • Figure 5: Segmentation results with image tokens from different stages. The output features of each stage are capable of skipping subsequent propagate while ensuring competent segmentation, with quality improving at deeper stages.
  • ...and 2 more figures