Table of Contents
Fetching ...

TinySAM: Pushing the Envelope for Efficient Segment Anything Model

Han Shu, Wenshuo Li, Yehui Tang, Yiman Zhang, Yihao Chen, Houqiang Li, Yunhe Wang, Xinghao Chen

TL;DR

This work addresses the high computational cost of the Segment Anything Model (SAM) by introducing TinySAM, a lightweight yet effective framework for zero-shot segmentation on constrained devices. It combines a hard mining full-stage knowledge distillation strategy with online hard prompt sampling and hard mask weighting, post-training quantization for the prompt-based segmentation task, and a hierarchical everything inference method to nearly halve the computation of the entire pipeline. Empirical results demonstrate substantial efficiency gains (including around 50% latency reduction for everything mode and up to two orders of magnitude in overall compute) while preserving strong zero-shot performance across COCO, LVIS, and other datasets, with competitive or superior AP against several baselines. The approach enables practical deployment of high-quality prompt-based segmentation on edge devices and broadens the applicability of zero-shot segmentation in real-world scenarios.

Abstract

Recently segment anything model (SAM) has shown powerful segmentation capability and has drawn great attention in computer vision fields. Massive following works have developed various applications based on the pre-trained SAM and achieved impressive performance on downstream vision tasks. However, SAM consists of heavy architectures and requires massive computational capacity, which hinders the further application of SAM on computation constrained edge devices. To this end, in this paper we propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance. We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model. We also adapt the post-training quantization to the prompt-based segmentation task and further reduce the computational cost. Moreover, a hierarchical segmenting everything strategy is proposed to accelerate the everything inference by $2\times$ with almost no performance degradation. With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task. Extensive experiments on various zero-shot transfer tasks demonstrate the significantly advantageous performance of our TinySAM against counterpart methods. Codes are available at https://github.com/xinghaochen/TinySAM and https://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.

TinySAM: Pushing the Envelope for Efficient Segment Anything Model

TL;DR

This work addresses the high computational cost of the Segment Anything Model (SAM) by introducing TinySAM, a lightweight yet effective framework for zero-shot segmentation on constrained devices. It combines a hard mining full-stage knowledge distillation strategy with online hard prompt sampling and hard mask weighting, post-training quantization for the prompt-based segmentation task, and a hierarchical everything inference method to nearly halve the computation of the entire pipeline. Empirical results demonstrate substantial efficiency gains (including around 50% latency reduction for everything mode and up to two orders of magnitude in overall compute) while preserving strong zero-shot performance across COCO, LVIS, and other datasets, with competitive or superior AP against several baselines. The approach enables practical deployment of high-quality prompt-based segmentation on edge devices and broadens the applicability of zero-shot segmentation in real-world scenarios.

Abstract

Recently segment anything model (SAM) has shown powerful segmentation capability and has drawn great attention in computer vision fields. Massive following works have developed various applications based on the pre-trained SAM and achieved impressive performance on downstream vision tasks. However, SAM consists of heavy architectures and requires massive computational capacity, which hinders the further application of SAM on computation constrained edge devices. To this end, in this paper we propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance. We first propose a full-stage knowledge distillation method with hard prompt sampling and hard mask weighting strategy to distill a lightweight student model. We also adapt the post-training quantization to the prompt-based segmentation task and further reduce the computational cost. Moreover, a hierarchical segmenting everything strategy is proposed to accelerate the everything inference by with almost no performance degradation. With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task. Extensive experiments on various zero-shot transfer tasks demonstrate the significantly advantageous performance of our TinySAM against counterpart methods. Codes are available at https://github.com/xinghaochen/TinySAM and https://gitee.com/mindspore/models/tree/master/research/cv/TinySAM.
Paper Structure (18 sections, 11 equations, 7 figures, 6 tables)

This paper contains 18 sections, 11 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: (a) The overall framework of our proposed method. Consisting the modules of the hard mining full-stage knowledge distillation, the post training quantization and the hierarchical everything inference, the computation cost is down-scaled by magnitudes. (b) The proposed TinySAM can save considerable computation cost while maintaining the performance. The latency is tested with TensorRT on NVIDIA T4 GPU.
  • Figure 2: The framework of the hard mining full-stage knowledge distillation. For the massive masks of SA-1B dataset, we design the hard prompt sampling for prompts and hard mask weighting for distillation loss. For sampling process, the stars represent sampling point with different iterations. With the increase of iterations, the sampling region is more closed to the edge of the target mask, which makes the prompt relatively harder for student network to learn. Moreover, according to the gap between student and teacher network, different weight is assigned to each mask when calculating the distillation loss.
  • Figure 3: Comparison between our hierarchical strategy and the original strategy. (a) Points sampling (take points_per_side=16 as an example) of original everything mode. (b) Segmentation results of original strategy. (c) First step of our hierarchical strategy, only $1/16$ points are sampled. (d) Get high confidence area from (c) and ignore points in this area. The high confidence area is shown as white mask. (e) Segmentation results of our hierarchical strategy.
  • Figure 4: Results of zero-shot points valid mask evaluation. X-axis represents the number of prompts points and Y-axis represents the mIoU across all masks. The proposed TinySAM outperforms MobileSAM and achieves results close to SAM-B.
  • Figure 5: Visualization for the process hierarchical everything strategy. (a) shows the intermediate result of high-confidence regions after 1st sparse prompt points with white mask and remained 2nd dense prompt points with green stars. (b) shows the final segmentation result and the small objects can be accurately segmented.
  • ...and 2 more figures