Towards Fine-grained Interactive Segmentation in Images and Videos
Yuan Yao, Qiushi Yang, Miaomiao Cui, Liefeng Bo
TL;DR
This work tackles the challenge of fine-grained interactive segmentation with SAM-based models by introducing SAM2Refiner, a SAM2-backed architecture that adds Localization Augment, Prompt Retargeting, and Mask Refinement to produce sharp, detail-rich masks for both images and videos. Localization Augment enriches global representations with local cues through cross-attention over multi-scale features, while Prompt Retargeting aligns prompts with an enhanced object embedding to boost responsiveness in intricate regions; Mask Refinement fuses multi-scale encoder features with the object embedding to deliver high-resolution masks. Trained on HQSeg-44K with fixed SAM2, SAM2Refiner demonstrates strong gains on image benchmarks (DIS, COIFT, HRSOD, ThinObject) and state-of-the-art performance on video segmentation tasks, validating its compatibility with the SAM2 video streaming pipeline. Collectively, the three modules enable precise, globally coherent yet locally detailed segmentation without sacrificing zero-shot capabilities or prompting stability, advancing practical interactive segmentation for both images and videos.
Abstract
The recent Segment Anything Models (SAMs) have emerged as foundational visual models for general interactive segmentation. Despite demonstrating robust generalization abilities, they still suffer performance degradations in scenarios demanding accurate masks. Existing methods for high-precision interactive segmentation face a trade-off between the ability to perceive intricate local details and maintaining stable prompting capability, which hinders the applicability and effectiveness of foundational segmentation models. To this end, we present an SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos while preserving its inherent strengths. Specifically, we design a localization augment module, which incorporates local contextual cues to enhance global features via a cross-attention mechanism, thereby exploiting potential detailed patterns and maintaining semantic information. Moreover, to strengthen the prompting ability toward the enhanced object embedding, we introduce a prompt retargeting module to renew the embedding with spatially aligned prompt features. In addition, to obtain accurate high resolution segmentation masks, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder. Extensive experiments demonstrate the effectiveness of our approach, revealing that the proposed method can produce highly precise masks for both images and videos, surpassing state-of-the-art methods.
