Table of Contents
Fetching ...

Towards Fine-grained Interactive Segmentation in Images and Videos

Yuan Yao, Qiushi Yang, Miaomiao Cui, Liefeng Bo

TL;DR

This work tackles the challenge of fine-grained interactive segmentation with SAM-based models by introducing SAM2Refiner, a SAM2-backed architecture that adds Localization Augment, Prompt Retargeting, and Mask Refinement to produce sharp, detail-rich masks for both images and videos. Localization Augment enriches global representations with local cues through cross-attention over multi-scale features, while Prompt Retargeting aligns prompts with an enhanced object embedding to boost responsiveness in intricate regions; Mask Refinement fuses multi-scale encoder features with the object embedding to deliver high-resolution masks. Trained on HQSeg-44K with fixed SAM2, SAM2Refiner demonstrates strong gains on image benchmarks (DIS, COIFT, HRSOD, ThinObject) and state-of-the-art performance on video segmentation tasks, validating its compatibility with the SAM2 video streaming pipeline. Collectively, the three modules enable precise, globally coherent yet locally detailed segmentation without sacrificing zero-shot capabilities or prompting stability, advancing practical interactive segmentation for both images and videos.

Abstract

The recent Segment Anything Models (SAMs) have emerged as foundational visual models for general interactive segmentation. Despite demonstrating robust generalization abilities, they still suffer performance degradations in scenarios demanding accurate masks. Existing methods for high-precision interactive segmentation face a trade-off between the ability to perceive intricate local details and maintaining stable prompting capability, which hinders the applicability and effectiveness of foundational segmentation models. To this end, we present an SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos while preserving its inherent strengths. Specifically, we design a localization augment module, which incorporates local contextual cues to enhance global features via a cross-attention mechanism, thereby exploiting potential detailed patterns and maintaining semantic information. Moreover, to strengthen the prompting ability toward the enhanced object embedding, we introduce a prompt retargeting module to renew the embedding with spatially aligned prompt features. In addition, to obtain accurate high resolution segmentation masks, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder. Extensive experiments demonstrate the effectiveness of our approach, revealing that the proposed method can produce highly precise masks for both images and videos, surpassing state-of-the-art methods.

Towards Fine-grained Interactive Segmentation in Images and Videos

TL;DR

This work tackles the challenge of fine-grained interactive segmentation with SAM-based models by introducing SAM2Refiner, a SAM2-backed architecture that adds Localization Augment, Prompt Retargeting, and Mask Refinement to produce sharp, detail-rich masks for both images and videos. Localization Augment enriches global representations with local cues through cross-attention over multi-scale features, while Prompt Retargeting aligns prompts with an enhanced object embedding to boost responsiveness in intricate regions; Mask Refinement fuses multi-scale encoder features with the object embedding to deliver high-resolution masks. Trained on HQSeg-44K with fixed SAM2, SAM2Refiner demonstrates strong gains on image benchmarks (DIS, COIFT, HRSOD, ThinObject) and state-of-the-art performance on video segmentation tasks, validating its compatibility with the SAM2 video streaming pipeline. Collectively, the three modules enable precise, globally coherent yet locally detailed segmentation without sacrificing zero-shot capabilities or prompting stability, advancing practical interactive segmentation for both images and videos.

Abstract

The recent Segment Anything Models (SAMs) have emerged as foundational visual models for general interactive segmentation. Despite demonstrating robust generalization abilities, they still suffer performance degradations in scenarios demanding accurate masks. Existing methods for high-precision interactive segmentation face a trade-off between the ability to perceive intricate local details and maintaining stable prompting capability, which hinders the applicability and effectiveness of foundational segmentation models. To this end, we present an SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos while preserving its inherent strengths. Specifically, we design a localization augment module, which incorporates local contextual cues to enhance global features via a cross-attention mechanism, thereby exploiting potential detailed patterns and maintaining semantic information. Moreover, to strengthen the prompting ability toward the enhanced object embedding, we introduce a prompt retargeting module to renew the embedding with spatially aligned prompt features. In addition, to obtain accurate high resolution segmentation masks, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder. Extensive experiments demonstrate the effectiveness of our approach, revealing that the proposed method can produce highly precise masks for both images and videos, surpassing state-of-the-art methods.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: An overview of the proposed framework SAM2Refiner. It contains a localization augment module to balance the detailed and semantic representations, a prompt retargeting for enhancing response of input prompts and a mask refinement structure to boost the quality of mask outcomes. The blue line denotes the SAM2 pipeline, and the black line denotes our SAM2Refiner pipeline.
  • Figure 2: A Comparison between SAM2 and SAM2Refiner. SAM2Refiner introduces three modules for fine-grained interactive segmentation, which can be seamlessly embedded into SAM2's video streaming pipeline to support both images and videos.
  • Figure 3: Qualitative comparison with previous methods. Given the blue box as visual prompt, our proposed SAM2Refiner produces more accurate results with correct structure and clear boundaries. Zoom in for better visualization.
  • Figure 4: Effectiveness of LA module. LA suggest remarkable improvement on local details although it displays some flickering boundaries.
  • Figure 5: Effectiveness of PR module. PR allows precise response for point prompts in ambiguous regions. Green point and red point denote positive and negative prompt, respectively.