Table of Contents
Fetching ...

BiSeg-SAM: Weakly-Supervised Post-Processing Framework for Boosting Binary Segmentation in Segment Anything Models

Encheng Su, Hu Cao, Alois Knoll

TL;DR

BiSeg-SAM addresses the challenge of medical binary segmentation with limited pixel-level annotations by refining SAM outputs through a weakly supervised post-processing framework. It introduces an Adaptively Global-Local Module to fuse local CNN features with SAM, a WeakBox module with MM2B transformation and Scale Consistency loss for adaptive box prompts, and a DetailRefine module that sharpens boundaries using a small set of GT examples. Experiments across five polyp datasets and ISIC demonstrate state-of-the-art performance, especially in multi-foreground and boundary-precise cases, while reducing annotation costs. The approach shows strong practical potential for medical image analysis and offers avenues for extension to other modalities and multi-modal fusion.

Abstract

Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.

BiSeg-SAM: Weakly-Supervised Post-Processing Framework for Boosting Binary Segmentation in Segment Anything Models

TL;DR

BiSeg-SAM addresses the challenge of medical binary segmentation with limited pixel-level annotations by refining SAM outputs through a weakly supervised post-processing framework. It introduces an Adaptively Global-Local Module to fuse local CNN features with SAM, a WeakBox module with MM2B transformation and Scale Consistency loss for adaptive box prompts, and a DetailRefine module that sharpens boundaries using a small set of GT examples. Experiments across five polyp datasets and ISIC demonstrate state-of-the-art performance, especially in multi-foreground and boundary-precise cases, while reducing annotation costs. The approach shows strong practical potential for medical image analysis and offers avenues for extension to other modalities and multi-modal fusion.

Abstract

Accurate segmentation of polyps and skin lesions is essential for diagnosing colorectal and skin cancers. While various segmentation methods for polyps and skin lesions using fully supervised deep learning techniques have been developed, the pixel-level annotation of medical images by doctors is both time-consuming and costly. Foundational vision models like the Segment Anything Model (SAM) have demonstrated superior performance; however, directly applying SAM to medical segmentation may not yield satisfactory results due to the lack of domain-specific medical knowledge. In this paper, we propose BiSeg-SAM, a SAM-guided weakly supervised prompting and boundary refinement network for the segmentation of polyps and skin lesions. Specifically, we fine-tune SAM combined with a CNN module to learn local features. We introduce a WeakBox with two functions: automatically generating box prompts for the SAM model and using our proposed Multi-choice Mask-to-Box (MM2B) transformation for rough mask-to-box conversion, addressing the mismatch between coarse labels and precise predictions. Additionally, we apply scale consistency (SC) loss for prediction scale alignment. Our DetailRefine module enhances boundary precision and segmentation accuracy by refining coarse predictions using a limited amount of ground truth labels. This comprehensive approach enables BiSeg-SAM to achieve excellent multi-task segmentation performance. Our method demonstrates significant superiority over state-of-the-art (SOTA) methods when tested on five polyp datasets and one skin cancer dataset.

Paper Structure

This paper contains 12 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The pipeline for our proposed BiSeg-SAM. It includes: (1) an SAM and CNN Module that integrates local detail information into SAM; (2) an automatic box prompting mechanism via the WeakBox Module; and (3) the DetailRefine Module, aimed at learning clear edge information and richer image features.
  • Figure 2: The architecture of the SAM and CNN Module. The image encoder from the SAM is frozen and integrated with a pre-trained CNN block. The output from the CNN block is concatenated with the frozen image encoder's output. The concatenated features are then passed through a gate before being fed into the mask decoder. Additionally, the bounding box generated during the MM2B block is used as a prompt for the B.
  • Figure 3: First, the center point is judged to determine whether it is in the foreground or background, clarifying the number of foreground objects. If it is a single foreground image, the bounding box is directly generated. If it is a multiple foreground image, the maximum and minimum values are used to generate a bounding box that encompasses all foregrounds.
  • Figure 4: Technical details of the WeakBox Module. The multi-scale information $P_1$ and $P_2$ are partially optimized through SC loss. They are then transformed into bounding boxes via MM2B. The transformed bounding boxes $T_1$ and $T_2$ are compared with the ground truth bounding box $B$ using BCE+Dice loss. The final loss of this module is obtained by combining these two losses.
  • Figure 5: The architecture of the DetailRefine Module. It includes convolutional layers (Conv), batch normalization (BN), ReLU activation, max pooling (MaxPool), and bilinear upsampling layers. The module refines segmentation by combining coarse predictions with residual corrections through element-wise addition.
  • ...and 1 more figures