Table of Contents
Fetching ...

FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

You Huang, Zongyu Lan, Liujuan Cao, Xianming Lin, Shengchuan Zhang, Guannan Jiang, Rongrong Ji

TL;DR

FocSAM addresses instability in SAM-based interactive segmentation by introducing a per-object focus refiner that dynamically concentrates image embeddings on the target using Dynamic Window MSA and Pixel-wise Dynamic ReLU. The focus refiner, comprised of plain and shift refine blocks, fuses initial interactive cues with object-centered embeddings to stabilize segmentation across successive clicks, while preserving SAM’s efficient preprocessing on CPU. Training combines NFL with a PTL loss in a two-stage process, and extensive experiments across six datasets show FocSAM matching or exceeding state-of-the-art NoC metrics while dramatically reducing CPU inference time (approx. 5.6% of the prior best). The results highlight practical gains for large-scale annotation and real-time interactive segmentation, with improved stability, speed, and scalability on CPU hardware.

Abstract

The Segment Anything Model (SAM) marks a notable milestone in segmentation models, highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder, ensuring efficient real-time performance. However, SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly, the image preprocessing disables SAM from dynamically using image-level zoom-in strategies to refocus on the target object during interaction. Secondly, the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations, we propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Dwin-MSA localizes attention computations around the target object, enhancing object-related embeddings with minimal computational overhead. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally, FocSAM augments SAM's interactive segmentation performance to match the existing state-of-the-art method in segmentation quality, requiring only about 5.6% of this method's inference time on CPUs.

FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

TL;DR

FocSAM addresses instability in SAM-based interactive segmentation by introducing a per-object focus refiner that dynamically concentrates image embeddings on the target using Dynamic Window MSA and Pixel-wise Dynamic ReLU. The focus refiner, comprised of plain and shift refine blocks, fuses initial interactive cues with object-centered embeddings to stabilize segmentation across successive clicks, while preserving SAM’s efficient preprocessing on CPU. Training combines NFL with a PTL loss in a two-stage process, and extensive experiments across six datasets show FocSAM matching or exceeding state-of-the-art NoC metrics while dramatically reducing CPU inference time (approx. 5.6% of the prior best). The results highlight practical gains for large-scale annotation and real-time interactive segmentation, with improved stability, speed, and scalability on CPU hardware.

Abstract

The Segment Anything Model (SAM) marks a notable milestone in segmentation models, highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder, ensuring efficient real-time performance. However, SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly, the image preprocessing disables SAM from dynamically using image-level zoom-in strategies to refocus on the target object during interaction. Secondly, the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations, we propose FocSAM with a pipeline redesigned on two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Dwin-MSA localizes attention computations around the target object, enhancing object-related embeddings with minimal computational overhead. Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally, FocSAM augments SAM's interactive segmentation performance to match the existing state-of-the-art method in segmentation quality, requiring only about 5.6% of this method's inference time on CPUs.
Paper Structure (25 sections, 10 equations, 8 figures, 4 tables)

This paper contains 25 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Interactive segmentation stability on a challenging example. The bottom-left shows the example overlaid with GT (purple masks). The top and middle rows illustrate the interactive segmentation of SAM and the proposed FocSAM, where each click is placed at the center of erroneously predicted regions and categorized as either positive (green) or negative (red). SAM's performance is unstable in this example (top row), where the $9$th click yields an IoU of $88.59$ (left) but a subsequent click significantly reduces the IoU to $12.78$ (right). In contrast, FocSAM (middle row) shows consistent performance. The plot (bottom-right) summarizes the trends of 20 clicks's segmentation, clearly contrasting SAM's IoU fluctuations with FocSAM's stable performance.
  • Figure 2: Overview of proposed FocSAM building upon SAM. SAM comprises an image encoder, a prompt encoder and a decoder. The image encoder transforms images into image embeddings before interaction. In each interaction of an object, the prompt encoder converts the previous mask and annotator clicks into mask and click embeddings, respectively. These three embeddings and a learnable query embedding are fed into the decoder for segmentation. Upon SAM's pipeline, FocSAM introduces a focus refiner that is employed once per object during interaction (Figure (a)). In an early step of SAM's interaction, this refiner processes SAM's image embeddings, previous mask and click-fused query embedding through a stack of refine blocks (Figure (b)). Each block receives the image and query embeddings with the mask shared across all the blocks, and produces the image and query embeddings fed into the subsequent block. The final output is a refined image embedding, which replaces the original image embedding for subsequent interactions with the object.
  • Figure 3: Overview of FocSAM's focus refiner. Figure (a) depicts the overall architecture of the focus refiner. Figure (b) details the refine block, showing the flow of image and query embeddings through the Dwin and MSA modules. Figures (c) and (d) highlight the window selection within the Dwin module and the shift strategy. Figure (e) provides a detailed view of the MSA module.
  • Figure 4: Stability analysis of interactive segmentation. We report results on SBD BharathHariharan2011SemanticCF, MVTec bergmann2019mvtec and COD10K fan2020camouflaged, and show $\Delta$IoU for consecutive clicks, filtering out $\Delta$IoU greater than $-1\%$. The results highlight FocSAM's superior stability over SAM, evidenced by fewer significant declines in segmentation quality with additional clicks.
  • Figure 5: Qualitative analysis on a challenge Example. The first image from the left displays the challenge example with the image and GT (blue masks). The top and bottom rows on the right respectively show the segmentation results of SAM and FocSAM at the $1^\text{st}$, $5^\text{th}$, $10^\text{th}$, and $20^\text{th}$ clicks. Clicks are indicated with green (positive) and red (negative) circles.
  • ...and 3 more figures