EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3
Chengxi Zeng, Yuxuan Jiang, Aaron Zhang
TL;DR
This work addresses the prohibitive on-device cost of SAM3 by introducing EfficientSAM3-PHD, a three-stage progressive hierarchical distillation framework that transfers SAM3's Promptable Concept Segmentation capabilities to lightweight student backbones. It first distills the image encoder (Stage 1) using prompt-in-the-loop supervision on SA-1B, then replaces the dense memory with a compact 2D Spatial Perceiver for video tracking (Stage 2), and finally performs end-to-end fine-tuning on SA-Co (Stage 3) to preserve PCS behavior. The approach yields a family of nine models across RepViT, TinyViT, and EfficientViT backbones, offering flexibility in accuracy-latency trade-offs for on-device deployment while maintaining fidelity to teacher behavior. By compressing temporal memory and enabling end-to-end refinement, EfficientSAM3-PHD enables practical, real-time PCS and tracking in on-device scenarios such as AR, robotics, and mobile tools, bridging the gap between state-of-the-art PCS and deployable intelligence.
Abstract
The Segment Anything Model 3 (SAM3) advances visual understanding with Promptable Concept Segmentation (PCS) across images and videos, but its unified architecture (shared vision backbone, DETR-style detector, dense-memory tracker) remains prohibitive for on-device use. We present EfficientSAM3, a family of efficient models built on Progressive Hierarchical Distillation (PHD) that transfers capability from SAM3 to lightweight students in three stages: (1) Encoder Distillation aligns image features via prompt-in-the-loop training on SA-1B; (2) Temporal Memory Distillation replaces dense memory with a compact Perceiver-based module trained on SA-V to compress and retrieve spatiotemporal features efficiently; and (3) End-to-End Fine-Tuning refines the full pipeline on the official SAM3 PCS data to preserve concept-level performance. PHD yields a spectrum of student variants using RepViT, TinyViT, and EfficientViT backbones, enabling on-device concept segmentation and tracking while maintaining high fidelity to teacher behavior. We benchmark on popular VOS datasets, and compare with varies of releated work, achieing strong performance-efficiency trade-offs.
