EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

Zhuoyang Zhang; Han Cai; Song Han

EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

Zhuoyang Zhang, Han Cai, Song Han

TL;DR

EfficientViT-SAM delivers a practical, high-accuracy, and highly efficient open-world segmentation alternative by replacing SAM's image encoder with EfficientViT while preserving the prompt encoder and mask decoder. The method uses two-stage training—distillation from SAM-ViT-H followed by end-to-end SA-1B optimization—to achieve a $48.9\times$ throughput boost on A100 with no drop in mAP. The approach demonstrates strong zero-shot performance across point- and box-prompted tasks and maintains advantages in wild and detector-assisted scenarios, all while maintaining hardware-friendly computation. By open-sourcing the code and models, the work enables broader deployment of fast, accurate segmentation in time-sensitive or resource-constrained settings.

Abstract

We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.

EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

TL;DR

throughput boost on A100 with no drop in mAP. The approach demonstrates strong zero-shot performance across point- and box-prompted tasks and maintains advantages in wild and detector-assisted scenarios, all while maintaining hardware-friendly computation. By open-sourcing the code and models, the work enables broader deployment of fast, accurate segmentation in time-sensitive or resource-constrained settings.

Abstract

Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Introduction
Related Work
Segment Anything Model
Efficient Deep Learning Computing
Method
EfficientViT
EfficientViT-SAM
Model Architecture.
Training.
Experiment
Runtime Efficiency
Zero-Shot Point-Prompted Segmentation
Zero-Shot Box-Prompted Segmentation
Zero-Shot In-the-Wild Segmentation
Qualitative Results
...and 1 more sections

Figures (3)

Figure 1: Throughput vs. COCO Zero-Shot Instance Segmentation mAP. As far as we know, EfficientViT-SAM is the first accelerated SAM model that matches/outperforms SAM-ViT-H's kirillov2023segment zero-shot performance, delivering the SOTA performance-efficiency trade-off.
Figure 2: Macro Architecture of EfficientViT-SAM-XL. 'ResBlock' refers to the basic building block from ResNet34 he2016deep. 'F-MBConv' refers to the fused MBConv block from tan2021efficientnetv2. 'EfficientViT Module' is the building block from cai2022efficientvit.
Figure 3: Qualitative Segmentation Results of EfficientViT-SAM under Point, Box, and Everything Mode.

EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

TL;DR

Abstract

EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

Authors

TL;DR

Abstract

Table of Contents

Figures (3)