EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss
Zhuoyang Zhang, Han Cai, Song Han
TL;DR
EfficientViT-SAM delivers a practical, high-accuracy, and highly efficient open-world segmentation alternative by replacing SAM's image encoder with EfficientViT while preserving the prompt encoder and mask decoder. The method uses two-stage training—distillation from SAM-ViT-H followed by end-to-end SA-1B optimization—to achieve a $48.9\times$ throughput boost on A100 with no drop in mAP. The approach demonstrates strong zero-shot performance across point- and box-prompted tasks and maintains advantages in wild and detector-assisted scenarios, all while maintaining hardware-friendly computation. By open-sourcing the code and models, the work enables broader deployment of fast, accurate segmentation in time-sensitive or resource-constrained settings.
Abstract
We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.
