Table of Contents
Fetching ...

Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

Jing Zhang, Zhikai Li, Xuewen Liu, Qingyi Gu

TL;DR

This paper tackles the computational bottlenecks of SAM2 in video object segmentation by revealing a sparse perception pattern: the mask decoder focuses on foreground while the image encoder processes broader regions, and memory attention favors a small, temporally consistent set of tokens. To exploit this, the authors propose Efficient-SAM2, a post-training acceleration framework consisting of object-aware Sparse Window Routing (SWR) for the image encoder and Sparse Memory Retrieval (SMR) for the memory attention. SWR routes background windows through a lightweight shortcut branch, guided by spatial-temporal cues from previous frames, while SMR caches and reuses memory saliency patterns to prune token-level computations. Across SAM2.1-B+/L models and multiple VOS benchmarks, Efficient-SAM2 delivers up to 1.68× end-to-end speedup with minimal accuracy loss, and SWR/SMR achieve 1.83× and 1.78× speedups respectively, showing strong practical potential for real-time video understanding without costly retraining.

Abstract

Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.

Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

TL;DR

This paper tackles the computational bottlenecks of SAM2 in video object segmentation by revealing a sparse perception pattern: the mask decoder focuses on foreground while the image encoder processes broader regions, and memory attention favors a small, temporally consistent set of tokens. To exploit this, the authors propose Efficient-SAM2, a post-training acceleration framework consisting of object-aware Sparse Window Routing (SWR) for the image encoder and Sparse Memory Retrieval (SMR) for the memory attention. SWR routes background windows through a lightweight shortcut branch, guided by spatial-temporal cues from previous frames, while SMR caches and reuses memory saliency patterns to prune token-level computations. Across SAM2.1-B+/L models and multiple VOS benchmarks, Efficient-SAM2 delivers up to 1.68× end-to-end speedup with minimal accuracy loss, and SWR/SMR achieve 1.83× and 1.78× speedups respectively, showing strong practical potential for real-time video understanding without costly retraining.

Abstract

Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.
Paper Structure (17 sections, 16 equations, 7 figures, 7 tables)

This paper contains 17 sections, 16 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The image encoder exhibits broad attention coverage, but the mask decoder focuses narrowly on prompt-relevant objects.
  • Figure 2: In memory attention, the memory frame exhibits concentrated attention distribution, suggesting redundancy in the memory bank, and its saliency pattern remains temporally consistent, as evidenced by high cosine similarity (CS) to its first recollection.
  • Figure 3: Overview of Efficient-SAM2. For image encoder, we introduce object-aware Sparse Window Routing (SWR), which assigns object-irrelevant background windows to a lightweight shortcut branch based on spatial-temporal consistency and perceptual saliency of the object, thus reducing encoding redundancy. For memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which builds a FIFO mask queue to retrieval most salient memory tokens, in which the saliency patterns are reused from their first recollection, thereby reducing the computational cost.
  • Figure 4: Detailed speedup analysis. Our method wins a well-balanced accuracy–speed trade-off.
  • Figure 5: Sparsity analysis of SWR and SMR. As $\tau$ grows, the window sparsity decreases but stays over 0.6, while the accuracy steadily increases. As the memory sparsity increases, the performance surprisingly surpasses the baseline and then declines,with $s$=0.95 offers a satisfied trade-off.
  • ...and 2 more figures