Table of Contents
Fetching ...

Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2

Zhiting Wang, Qiangong Zhou, Zongyang Liu

TL;DR

Det-SAM2 addresses the need for fully automated video object segmentation by removing manual prompts from SAM2 through a detection-driven prompting strategy based on YOLOv8. The framework combines a detection-driven prompt source, a SAM2-based video predictor with memory-augmented propagation, and post-processing to enable long, continuous video inference with constant memory footprints. Key contributions include: (1) automatic per-frame prompting, (2) cumulative and limited propagation strategies to reduce compute, (3) a preloadable offline Memory Bank for transfer across videos, (4) online addition of new object IDs without memory resets, and (5) GPU/CPU memory optimizations to sustain constant VRAM usage. The approach is validated with a billiards scenario, showing SAM2-level segmentation quality and practical applicability for automated decision-making in real-time streams, with potential extension to other long-video tasks.”

Abstract

Segment Anything Model 2 (SAM2) demonstrates exceptional performance in video segmentation and refinement of segmentation results. We anticipate that it can further evolve to achieve higher levels of automation for practical applications. Building upon SAM2, we conducted a series of practices that ultimately led to the development of a fully automated pipeline, termed Det-SAM2, in which object prompts are automatically generated by a detection model to facilitate inference and refinement by SAM2. This pipeline enables inference on infinitely long video streams with constant VRAM and RAM usage, all while preserving the same efficiency and accuracy as the original SAM2. This technical report focuses on the construction of the overall Det-SAM2 framework and the subsequent engineering optimization applied to SAM2. We present a case demonstrating an application built on the Det-SAM2 framework: AI refereeing in a billiards scenario, derived from our business context. The project at \url{https://github.com/motern88/Det-SAM2}.

Det-SAM2:Technical Report on the Self-Prompting Segmentation Framework Based on Segment Anything Model 2

TL;DR

Det-SAM2 addresses the need for fully automated video object segmentation by removing manual prompts from SAM2 through a detection-driven prompting strategy based on YOLOv8. The framework combines a detection-driven prompt source, a SAM2-based video predictor with memory-augmented propagation, and post-processing to enable long, continuous video inference with constant memory footprints. Key contributions include: (1) automatic per-frame prompting, (2) cumulative and limited propagation strategies to reduce compute, (3) a preloadable offline Memory Bank for transfer across videos, (4) online addition of new object IDs without memory resets, and (5) GPU/CPU memory optimizations to sustain constant VRAM usage. The approach is validated with a billiards scenario, showing SAM2-level segmentation quality and practical applicability for automated decision-making in real-time streams, with potential extension to other long-video tasks.”

Abstract

Segment Anything Model 2 (SAM2) demonstrates exceptional performance in video segmentation and refinement of segmentation results. We anticipate that it can further evolve to achieve higher levels of automation for practical applications. Building upon SAM2, we conducted a series of practices that ultimately led to the development of a fully automated pipeline, termed Det-SAM2, in which object prompts are automatically generated by a detection model to facilitate inference and refinement by SAM2. This pipeline enables inference on infinitely long video streams with constant VRAM and RAM usage, all while preserving the same efficiency and accuracy as the original SAM2. This technical report focuses on the construction of the overall Det-SAM2 framework and the subsequent engineering optimization applied to SAM2. We present a case demonstrating an application built on the Det-SAM2 framework: AI refereeing in a billiards scenario, derived from our business context. The project at \url{https://github.com/motern88/Det-SAM2}.

Paper Structure

This paper contains 20 sections, 12 figures.

Figures (12)

  • Figure 1: Overview of Det-SAM2 Tasks. The overall technical pipeline of Det-SAM2 comprises three key components: the detection module, the pixel-level video tracking module using SAM2 instances, and the post-processing module. The detection model provides initial (potentially imperfect) bounding boxes, which are used as conditional prompts for SAM2. The SAM2 video predictor propagates these discrete frame prompts (propagate_in_video) across all frames in the video, enabling continuous inference. Ultimately, the SAM2 video predictor outputs spatiotemporal masks for object instances throughout the video. The post-processing module then analyzes the obtained masks to deliver accurate and quantifiable results, thereby supporting higher-level applications such as an AI coach or AI referee in billiards scenarios.
  • Figure 2: Original Framework of SAM2. The video frame features are processed through Memory Attention, integrating information from the current frame with that in the Memory Bank, and then passed to the Mask Decoder, which uses the conditional prompts to generate the predicted masks. The Memory Bank is extracted by the Memory Decoder from the conditional frames. The Memory Decoder receives outputs not only from the Mask Decoder but also from the Image Encoder.
  • Figure 3: Det-SAM2 Experimental Demo Framework Diagram. The condition prompt for a given frame is automatically added, and the condition prompt is provided by the detection model (in this case, YOLOv8). The detection box results are used as the prompt input for the Prompt Encoder.
  • Figure 4: The Det-SAM2 framework facilitates for the automatic addition of condition prompts for each frame. In contrast to Figure \ref{['fig:fig3']}, where the detection model branch is not only active in the initial frame of the video, it is now applied to every frame throughout the video.
  • Figure 5: Det-SAM2 Video Stream Processing Diagram. Each frame passes through the Detection Model as a condition frame for SAM2 (represented in green in the diagram), and then the propagation operation (depicted in yellow as "propagate in video") is applied to all previously processed video frames to enable the correction capability.
  • ...and 7 more figures