Table of Contents
Fetching ...

Lite-SAM Is Actually What You Need for Segment Everything

Jianhai Fu, Yuanjie Yu, Ningchuan Li, Yi Zhang, Qichao Chen, Jianping Xiong, Jun Yin, Zhiyu Xiang

TL;DR

The paper tackles the high computational cost of SegEvery in Segment Anything (SAM). It introduces Lite-SAM, an end-to-end lightweight framework built from LiteViT and AutoPPN that replaces grid-search prompts with an automated, efficient prompt proposal mechanism. The authors show that Lite-SAM achieves state-of-the-art efficiency with around 4.2M total parameters and SegEvery times as low as about 80 ms, while maintaining competitive accuracy compared with larger SAM variants. The work demonstrates strong results on COCO and LVIS in zero-shot settings and even competitive edge detection on BSDS500, highlighting practical applicability for real-time segmentation on resource-constrained devices.

Abstract

This paper introduces Lite-SAM, an efficient end-to-end solution for the SegEvery task designed to reduce computational costs and redundancy. Lite-SAM is composed of four main components: a streamlined CNN-Transformer hybrid encoder (LiteViT), an automated prompt proposal network (AutoPPN), a traditional prompt encoder, and a mask decoder. All these components are integrated within the SAM framework. Our LiteViT, a high-performance lightweight backbone network, has only 1.16M parameters, which is a 23% reduction compared to the lightest existing backbone network Shufflenet. We also introduce AutoPPN, an innovative end-to-end method for prompt boxes and points generation. This is an improvement over traditional grid search sampling methods, and its unique design allows for easy integration into any SAM series algorithm, extending its usability. we have thoroughly benchmarked Lite-SAM across a plethora of both public and private datasets. The evaluation encompassed a broad spectrum of universal metrics, including the number of parameters, SegEvery execution time, and accuracy. The findings reveal that Lite-SAM, operating with a lean 4.2M parameters, significantly outpaces its counterparts, demonstrating performance improvements of 43x, 31x, 20x, 21x, and 1.6x over SAM, MobileSAM, Edge-SAM, EfficientViT-SAM, and MobileSAM-v2 respectively, all the while maintaining competitive accuracy. This underscores Lite-SAM's prowess in achieving an optimal equilibrium between performance and precision, thereby setting a new state-of-the-art(SOTA) benchmark in the domain.

Lite-SAM Is Actually What You Need for Segment Everything

TL;DR

The paper tackles the high computational cost of SegEvery in Segment Anything (SAM). It introduces Lite-SAM, an end-to-end lightweight framework built from LiteViT and AutoPPN that replaces grid-search prompts with an automated, efficient prompt proposal mechanism. The authors show that Lite-SAM achieves state-of-the-art efficiency with around 4.2M total parameters and SegEvery times as low as about 80 ms, while maintaining competitive accuracy compared with larger SAM variants. The work demonstrates strong results on COCO and LVIS in zero-shot settings and even competitive edge detection on BSDS500, highlighting practical applicability for real-time segmentation on resource-constrained devices.

Abstract

This paper introduces Lite-SAM, an efficient end-to-end solution for the SegEvery task designed to reduce computational costs and redundancy. Lite-SAM is composed of four main components: a streamlined CNN-Transformer hybrid encoder (LiteViT), an automated prompt proposal network (AutoPPN), a traditional prompt encoder, and a mask decoder. All these components are integrated within the SAM framework. Our LiteViT, a high-performance lightweight backbone network, has only 1.16M parameters, which is a 23% reduction compared to the lightest existing backbone network Shufflenet. We also introduce AutoPPN, an innovative end-to-end method for prompt boxes and points generation. This is an improvement over traditional grid search sampling methods, and its unique design allows for easy integration into any SAM series algorithm, extending its usability. we have thoroughly benchmarked Lite-SAM across a plethora of both public and private datasets. The evaluation encompassed a broad spectrum of universal metrics, including the number of parameters, SegEvery execution time, and accuracy. The findings reveal that Lite-SAM, operating with a lean 4.2M parameters, significantly outpaces its counterparts, demonstrating performance improvements of 43x, 31x, 20x, 21x, and 1.6x over SAM, MobileSAM, Edge-SAM, EfficientViT-SAM, and MobileSAM-v2 respectively, all the while maintaining competitive accuracy. This underscores Lite-SAM's prowess in achieving an optimal equilibrium between performance and precision, thereby setting a new state-of-the-art(SOTA) benchmark in the domain.
Paper Structure (27 sections, 1 equation, 7 figures, 10 tables)

This paper contains 27 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The proposed Lite-SAM achieves SOTA performance in terms of Backbone Parameters (top left), Full Parameters (top right), Multiply-Accumulate Operations (bottom left), and SegEvery time (bottom right) tasks while maintaining computational efficiency. The metrics were evaluated on the zero-shot learning of the COCO dataset. Note that the comparison of backbone parameters is made against lightweight network structures (params $\leq$ 40M), with MAE not falling within this scope.
  • Figure 2: (a) Overview of the proposed Lite-SAM. The architecture consists of two detachable blocks, namely the Lightwight ViT backbone (LiteViT), Automated Prompt Proposal Network (AutoPPN). (b) Macro Architecture of LiteViT. (c) Macro Architecture of AutoPPN.
  • Figure 3: Overview of architectural choice. (1) represents the original PoolFormer Block, (2),(3) and (4) show the modifications to PoolFormer Block, and (5) the final version of our Multiscale Pooling Self Attention module.
  • Figure 4: We compare two methods of generating pointwise foreground/background labels within an image (sa$\_$3196.jpg) from SA-1B kirillov2023segment (a). All the masks are visualized as shown in (b). The pointwise labels generated by large, medium, small masks, are visualized with red, green and blue color, respectively. Comparing with bounding box center with gaussian kernel approach (c), distance transform approach (d) provides a more statisfactory result with less ambiguity.
  • Figure 5: Qualitative results on "SegEvery". Models demonstrate mask generation capabilities. (1) Note that EfficientViT-SAM's caiefficientvit result is based on L1 model. (2) Lite-SAM employs an inference size of 640 $\times$ 640, while other comparison algorithms utilize a default size of 1024 $\times$ 1024.
  • ...and 2 more figures