Table of Contents
Fetching ...

Efficient Track Anything

Yunyang Xiong, Chong Zhou, Xiaoyu Xiang, Lemeng Wu, Chenchen Zhu, Zechun Liu, Saksham Suri, Balakrishnan Varadarajan, Ramya Akula, Forrest Iandola, Raghuraman Krishnamoorthi, Bilge Soran, Vikas Chandra

TL;DR

EfficientTAM addresses the higher computational burden of SAM 2 for video object segmentation and tracking by replacing the heavy hierarchical image encoder with a vanilla lightweight ViT and introducing an efficient memory cross-attention that employs coarse spatial token pooling. Trained on SA-1B and SA-V, EfficientTAM achieves competitive performance to SAM 2 while offering roughly $2\times$ speedups on GPUs and about $2.4\times$ fewer parameters, and delivers on-device capability with around $10$ FPS on mobile hardware. Ablation studies show the importance of object-pointer tokens and the coarse memory-token surrogate for cross-attention, establishing a strong accuracy-efficiency tradeoff across both video and image segmentation tasks. This work enables practical, on-device video segmentation and tracking, expanding the deployability of segmentation foundation models in real-world applications.

Abstract

Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

Efficient Track Anything

TL;DR

EfficientTAM addresses the higher computational burden of SAM 2 for video object segmentation and tracking by replacing the heavy hierarchical image encoder with a vanilla lightweight ViT and introducing an efficient memory cross-attention that employs coarse spatial token pooling. Trained on SA-1B and SA-V, EfficientTAM achieves competitive performance to SAM 2 while offering roughly speedups on GPUs and about fewer parameters, and delivers on-device capability with around FPS on mobile hardware. Ablation studies show the importance of object-pointer tokens and the coarse memory-token surrogate for cross-attention, establishing a strong accuracy-efficiency tradeoff across both video and image segmentation tasks. This work enables practical, on-device video segmentation and tracking, expanding the deployability of segmentation foundation models in real-world applications.

Abstract

Segment Anything Model 2 (SAM 2) has emerged as a powerful tool for video object segmentation and tracking anything. Key components of SAM 2 that drive the impressive video object segmentation performance include a large multistage image encoder for frame feature extraction and a memory mechanism that stores memory contexts from past frames to help current frame segmentation. The high computation complexity of multistage image encoder and memory module has limited its applications in real-world tasks, e.g., video object segmentation on mobile devices. To address this limitation, we propose EfficientTAMs, lightweight track anything models that produce high-quality results with low latency and model size. Our idea is based on revisiting the plain, nonhierarchical Vision Transformer (ViT) as an image encoder for video object segmentation, and introducing an efficient memory module, which reduces the complexity for both frame feature extraction and memory computation for current frame segmentation. We take vanilla lightweight ViTs and efficient memory module to build EfficientTAMs, and train the models on SA-1B and SA-V datasets for video object segmentation and track anything tasks. We evaluate on multiple video segmentation benchmarks including semi-supervised VOS and promptable video segmentation, and find that our proposed EfficientTAM with vanilla ViT perform comparably to SAM 2 model (HieraB+SAM 2) with ~2x speedup on A100 and ~2.4x parameter reduction. On segment anything image tasks, our EfficientTAMs also perform favorably over original SAM with ~20x speedup on A100 and ~20x parameter reduction. On mobile devices such as iPhone 15 Pro Max, our EfficientTAMs can run at ~10 FPS for performing video object segmentation with reasonable quality, highlighting the capability of small models for on-device video object segmentation applications.

Paper Structure

This paper contains 14 sections, 1 theorem, 13 equations, 8 figures, 8 tables.

Key Result

Lemma 1

For the coarse memory tokens, $\bar{K}$ and $\bar{V}$, queries $Q\in\mathbb{R}^{L\times d}$, we have, where $A = [\frac{Q\Tilde{K}_s^{T}}{\sqrt{d}} + \ln{(l_w\times l_h)}, \frac{QK_p^{T}}{\sqrt{d}}] \in \mathbb{R}^{L\times (\Tilde{w}\Tilde{h} + P)}$, $\Tilde{V} = [\Tilde{V}_s; V_p] \in \mathbb{R}^{(\Tilde{w}\Tilde{h}+P)\times d}$.

Figures (8)

  • Figure 1: Comparative analysis. (Left) Speed comparison between EfficientTAM and SAM 2 on a single NVIDIA A100 GPU. While SAM 2 is challenging for on-device deployment, our EfficientTAM can run 261 ms per frame on iPhone 15 Pro Max. (Right) FPS/Parameter/Performance comparison of EfficientTAM, SAM 2, and other efficient models for zero-shot video object segmentation on SA-V test. We benchmark FPS (frames per second) of all models with 1024 × 1024 input resolution on a single NVIDIA A100.
  • Figure 2: EfficientTAM architecture. Our proposed EfficientTAM takes a vanilla lightweight ViT image encoder for frame feature extraction. An efficient memory cross-attention is proposed to further improve the efficiency of EfficientTAM by leveraging the strong locality of memory spatial embeddings. EfficientTAM is fully trained on SA-1B (image) and SA-V (video) for unified image and video segmentation.
  • Figure 3: An example to show strong locality of the Keys and Values in the cross-attention of the memory module. Keys and Values are a matrix of size $28700\times 256$. Cross-attention is a matrix of size $4096\times 256$. For simplicity of visualizing and comparison, we only draw the top matrix of size $320\times256$. We use a single averaged token to represent other tokens in the homogeneous window with a $2\times 2$ size, for Keys and Values to obtain coarse Keys and Values. At right, we visualize the difference between original cross-attention of \ref{['eq:crossattn']} and efficient cross-attention of \ref{['eq:ecrossattn']}; the relative error w.r.t original cross-attention is $0.03$ under Frobenius norm.
  • Figure 4: Promptable video segmentation results across 9 video segmentation datasets under interactive offline (left) and online (right) evaluation settings. The average $\mathcal{J}$&$\mathcal{F}$ over $1, \dots, 8$ interacted frames is reported.
  • Figure 5: Visualization results on video segmentation and tracking with SAM 2, and our EfficientTAM model. We sampled a subset of frames for visualization. The segmented objects, e.g., the goose and the camel, are colored in red.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof