Table of Contents
Fetching ...

Q-SAM2: Accurate Quantization for Segment Anything Model 2

Nicola Farronato, Florian Scheidegger, Mattia Rigotti, Cristiano Malossi, Michele Magno, Haotong Qin

TL;DR

Q-SAM2 is presented, an accurate low-bit quantization method that achieves high compression and high fidelity and introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch.

Abstract

The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, Q-SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.

Q-SAM2: Accurate Quantization for Segment Anything Model 2

TL;DR

Q-SAM2 is presented, an accurate low-bit quantization method that achieves high compression and high fidelity and introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch.

Abstract

The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present Q-SAM2, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, Q-SAM2 introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that Q-SAM2 achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, Q-SAM2 achieves an accuracy gain of up to 9.7 ppt in J&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.

Paper Structure

This paper contains 27 sections, 7 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Comparison between Q-SAM2 and Post-Training Quantization (PTQ) based on SAM kirillov2023segany [ViT-B/L/H]. Q-SAM2 defines a new SOTA Pareto frontier; even at its smallest precision (W2A2), it maintains accuracy comparable to higher-precision SOTA methods (PTQ4SAM PTQ4SAMLv, BRECQ li2021brecq, QDROPwei2023qdroprandomlydroppingquantization).
  • Figure 2: The Q-SAM2 approach. The weight distributions of the linear layers in the image encoder are calibrated using the VRC to reduce variance. We substitute the original encoder and train the network using the LSC in a QAT pipeline.
  • Figure 3: Impact of VRC on SAM2.1-B+ image encoder weight distributions for $\lambda_0=2.0$. VRC achieves a significant (10-20%) reduction in standard deviation, averaged per transformer block. This compresses the dynamic range, critically lowering the initial quantization error for the subsequent QAT.
  • Figure 4: Qualitative results on W2A2 configuration for B+ encoder on promptable instance segmentation task.
  • Figure 5: Calibration error for the linear layer cut.blocks.21.mlp.layers.1 from the image encoder of the B+ model. In subfigure (a), we compare the output of the original linear layer with the output produced using the calibrated weights $\mathbf{\hat{W}}_\lambda$, measured via the $L_2$ norm, across different values of the hyperparameter $\lambda$. In subfigure (b), we report the quantization error of $\mathbf{\hat{W}}_\lambda$ for varying $\lambda$, relative to the quantization error of the original weights $\mathbf{W}$.
  • ...and 12 more figures