Table of Contents
Fetching ...

EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

Zhuoyang Zhang, Han Cai, Song Han

TL;DR

EfficientViT-SAM delivers a practical, high-accuracy, and highly efficient open-world segmentation alternative by replacing SAM's image encoder with EfficientViT while preserving the prompt encoder and mask decoder. The method uses two-stage training—distillation from SAM-ViT-H followed by end-to-end SA-1B optimization—to achieve a $48.9\times$ throughput boost on A100 with no drop in mAP. The approach demonstrates strong zero-shot performance across point- and box-prompted tasks and maintains advantages in wild and detector-assisted scenarios, all while maintaining hardware-friendly computation. By open-sourcing the code and models, the work enables broader deployment of fast, accurate segmentation in time-sensitive or resource-constrained settings.

Abstract

We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.

EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss

TL;DR

EfficientViT-SAM delivers a practical, high-accuracy, and highly efficient open-world segmentation alternative by replacing SAM's image encoder with EfficientViT while preserving the prompt encoder and mask decoder. The method uses two-stage training—distillation from SAM-ViT-H followed by end-to-end SA-1B optimization—to achieve a throughput boost on A100 with no drop in mAP. The approach demonstrates strong zero-shot performance across point- and box-prompted tasks and maintains advantages in wild and detector-assisted scenarios, all while maintaining hardware-friendly computation. By open-sourcing the code and models, the work enables broader deployment of fast, accurate segmentation in time-sensitive or resource-constrained settings.

Abstract

We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.
Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Throughput vs. COCO Zero-Shot Instance Segmentation mAP. As far as we know, EfficientViT-SAM is the first accelerated SAM model that matches/outperforms SAM-ViT-H's kirillov2023segment zero-shot performance, delivering the SOTA performance-efficiency trade-off.
  • Figure 2: Macro Architecture of EfficientViT-SAM-XL. 'ResBlock' refers to the basic building block from ResNet34 he2016deep. 'F-MBConv' refers to the fused MBConv block from tan2021efficientnetv2. 'EfficientViT Module' is the building block from cai2022efficientvit.
  • Figure 3: Qualitative Segmentation Results of EfficientViT-SAM under Point, Box, and Everything Mode.