Table of Contents
Fetching ...

Promoting Segment Anything Model towards Highly Accurate Dichotomous Image Segmentation

Xianjie Liu, Keren Fu, Yao Jiang, Qijun Zhao

TL;DR

DIS-SAM tackles the challenge of converting SAM's robust zero-shot segmentation into highly accurate dichotomous segmentation by a two-stage refinement with IS-Net, retaining promptability. The method introduces GT enrichment and a composite loss with parameter orthogonalization to improve boundary precision. Experiments on DIS-5K and HQSeg-44k show substantial improvements in $F^{max}_\beta$ and related metrics over SAM, HQ-SAM, and Pi-SAM, with strong zero-shot generalization. The approach demonstrates how combining a foundation model with a specialized DIS network can deliver precise object boundaries while maintaining interactive capabilities.

Abstract

The Segment Anything Model (SAM) represents a significant breakthrough into foundation models for computer vision, providing a large-scale image segmentation model. However, despite SAM's zero-shot performance, its segmentation masks lack fine-grained details, particularly in accurately delineating object boundaries. Therefore, it is both interesting and valuable to explore whether SAM can be improved towards highly accurate object segmentation, which is known as the dichotomous image segmentation (DIS) task. To address this issue, we propose DIS-SAM, which advances SAM towards DIS with extremely accurate details. DIS-SAM is a framework specifically tailored for highly accurate segmentation, maintaining SAM's promptable design. DIS-SAM employs a two-stage approach, integrating SAM with a modified advanced network that was previously designed to handle the prompt-free DIS task. To better train DIS-SAM, we employ a ground truth enrichment strategy by modifying original mask annotations. Despite its simplicity, DIS-SAM significantly advances the SAM, HQ-SAM, and Pi-SAM ~by 8.5%, ~6.9%, and ~3.7% maximum F-measure. Our code at https://github.com/Tennine2077/DIS-SAM

Promoting Segment Anything Model towards Highly Accurate Dichotomous Image Segmentation

TL;DR

DIS-SAM tackles the challenge of converting SAM's robust zero-shot segmentation into highly accurate dichotomous segmentation by a two-stage refinement with IS-Net, retaining promptability. The method introduces GT enrichment and a composite loss with parameter orthogonalization to improve boundary precision. Experiments on DIS-5K and HQSeg-44k show substantial improvements in and related metrics over SAM, HQ-SAM, and Pi-SAM, with strong zero-shot generalization. The approach demonstrates how combining a foundation model with a specialized DIS network can deliver precise object boundaries while maintaining interactive capabilities.

Abstract

The Segment Anything Model (SAM) represents a significant breakthrough into foundation models for computer vision, providing a large-scale image segmentation model. However, despite SAM's zero-shot performance, its segmentation masks lack fine-grained details, particularly in accurately delineating object boundaries. Therefore, it is both interesting and valuable to explore whether SAM can be improved towards highly accurate object segmentation, which is known as the dichotomous image segmentation (DIS) task. To address this issue, we propose DIS-SAM, which advances SAM towards DIS with extremely accurate details. DIS-SAM is a framework specifically tailored for highly accurate segmentation, maintaining SAM's promptable design. DIS-SAM employs a two-stage approach, integrating SAM with a modified advanced network that was previously designed to handle the prompt-free DIS task. To better train DIS-SAM, we employ a ground truth enrichment strategy by modifying original mask annotations. Despite its simplicity, DIS-SAM significantly advances the SAM, HQ-SAM, and Pi-SAM ~by 8.5%, ~6.9%, and ~3.7% maximum F-measure. Our code at https://github.com/Tennine2077/DIS-SAM
Paper Structure (15 sections, 3 equations, 5 figures, 3 tables)

This paper contains 15 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overall pipeline of the proposed DIS-SAM.
  • Figure 2: An example of segmenting out connected components, where the GT image is decomposed into three parts, corresponding to three masks. For the sake of space, the original color image is omitted.
  • Figure 3: Visual results of DIS-SAM, HQ-SAM ke2024segment, SAM kirillov2023segment, and IS-Net qin2022highly.
  • Figure 4: Visual results of promptable capability of DIS-SAM.
  • Figure 5: Visual results of ablation study. "Box" and "Mask" indicate whether to concatenate prompt box or SAM's mask as input during the second stage.