EdgeTAM: On-Device Track Anything Model

Chong Zhou; Chenchen Zhu; Yunyang Xiong; Saksham Suri; Fanyi Xiao; Lemeng Wu; Raghuraman Krishnamoorthi; Bo Dai; Chen Change Loy; Vikas Chandra; Bilge Soran

EdgeTAM: On-Device Track Anything Model

Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran

TL;DR

EdgeTAM addresses the on-device latency bottleneck of SAM 2 by replacing dense memory attention with a lightweight 2D Spatial Perceiver and by using distillation to boost accuracy. The core idea is to compress dense frame-level memories with a global and a 2D latent attention mechanism, reducing memory-attention complexity from $O(TCH^2W^2)$ to $O(TCHW(N_g+N_l))$ while preserving spatial structure. A two-stage distillation pipeline aligns the student with the SAM 2 teacher in both image and video settings, improving $\\mathcal{J}\\&\\\mathcal{F}$ on SA-V by about 1.3–3.3 points without inference overhead. Empirically, EdgeTAM achieves competitive results on PVS, SA, and VOS benchmarks and runs at 16 FPS on a mobile GPU, enabling on-device unified segmentation and tracking for multimedia applications.

Abstract

On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a lightweight Transformer that contains a fixed set of learnable queries. Given that video segmentation is a dense prediction task, we find preserving the spatial structure of the memories is essential so that the queries are split into global-level and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. As a result, EdgeTAM achieves 87.7, 70.0, 72.3, and 71.7 J&F on DAVIS 2017, MOSE, SA-V val, and SA-V test, while running at 16 FPS on iPhone 15 Pro Max.

EdgeTAM: On-Device Track Anything Model

TL;DR

while preserving spatial structure. A two-stage distillation pipeline aligns the student with the SAM 2 teacher in both image and video settings, improving

on SA-V by about 1.3–3.3 points without inference overhead. Empirically, EdgeTAM achieves competitive results on PVS, SA, and VOS benchmarks and runs at 16 FPS on a mobile GPU, enabling on-device unified segmentation and tracking for multimedia applications.

Abstract

Paper Structure (18 sections, 8 equations, 6 figures, 5 tables)

This paper contains 18 sections, 8 equations, 6 figures, 5 tables.

Introduction
Related Work
Methodology
Preliminary: SAM 2
EdgeTAM
Distillation Pipeline
Experiments
Implementation Details
Datasets
Promptable Video Segmentation (PVS)
Segment Anything (SA)
Video Object Segmentation (VOS)
Ablations
Qualitative Results
Conclusion
...and 3 more sections

Figures (6)

Figure 1: Speed-performance trade-offs on iPhone 15 Pro Max and NVIDIA A100. EdgeTAM is significantly faster than SAM 2 on edge devices and compare to other VOS methods, it is also more accurate on the challenging SA-V val dataset. Note that, EdgeTAM can run at 16 FPS on iPhone 15 Pro Max.
Figure 2: Single frame latency (ms) on iPhone. In (a), we show that only replacing image encoder with more compact backbones is not enough for further speed-up since decoder is also a bottleneck. In (b), through reducing the number of memory attention blocks and removing certain modules, we find that the cross attention (CA) is the root cause.
Figure 3: Overall architecture of EdgeTAM. The meta architecture of EdgeTAM follow SAM 2 and the main difference is the proposed plug-in module, 2D Spatial Perceiver, which is marked with orange dotted box.
Figure 4: The distillation pipeline in EdgeTAM. In the image pre-training stage, we align the features from teacher's and student's image encoder. And in the video training stage, we additionally align the features output from memory attention between teacher and student. For both stages, task-specific losses are used.
Figure 5: Zero-shot PVS accuracy across 9 datasets in offline and online settings.
...and 1 more figures

EdgeTAM: On-Device Track Anything Model

TL;DR

Abstract

EdgeTAM: On-Device Track Anything Model

Authors

TL;DR

Abstract

Table of Contents

Figures (6)