Table of Contents
Fetching ...

SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

Shuangping Huang, Hao Liang, Qingfeng Wang, Chulong Zhong, Zijian Zhou, Miaojing Shi

TL;DR

SEG-SAM addresses the challenge of unified semantic medical segmentation by coupling SAM's strong binary segmentation with a dedicated Semantic-Aware Decoder and a text-driven language-vision module. The semantic decoder provides semantic masks for prompted objects while classifying unprompted objects, and the text-to-vision enhancement injects domain knowledge from large language models into the segmentation process. A cross-mask spatial alignment loss ensures consistency between binary and semantic predictions, boosting both tasks. Across Med2D-16M and cross-dataset benchmarks, SEG-SAM achieves state-of-the-art performance for binary segmentation and superior semantic segmentation, demonstrating the practical value of language-informed semantic priors in clinical imaging.

Abstract

Recently, developing unified medical image segmentation models gains increasing attention, especially with the advent of the Segment Anything Model (SAM). SAM has shown promising binary segmentation performance in natural domains, however, transferring it to the medical domain remains challenging, as medical images often possess substantial inter-category overlaps. To address this, we propose the SEmantic-Guided SAM (SEG-SAM), a unified medical segmentation model that incorporates semantic medical knowledge to enhance medical segmentation performance. First, to avoid the potential conflict between binary and semantic predictions, we introduce a semantic-aware decoder independent of SAM's original decoder, specialized for both semantic segmentation on the prompted object and classification on unprompted objects in images. To further enhance the model's semantic understanding, we solicit key characteristics of medical categories from large language models and incorporate them into SEG-SAM through a text-to-vision semantic module, adaptively transferring the language information into the visual segmentation task. In the end, we introduce the cross-mask spatial alignment strategy to encourage greater overlap between the predicted masks from SEG-SAM's two decoders, thereby benefiting both predictions. Extensive experiments demonstrate that SEG-SAM outperforms state-of-the-art SAM-based methods in unified binary medical segmentation and task-specific methods in semantic medical segmentation, showcasing promising results and potential for broader medical applications.

SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

TL;DR

SEG-SAM addresses the challenge of unified semantic medical segmentation by coupling SAM's strong binary segmentation with a dedicated Semantic-Aware Decoder and a text-driven language-vision module. The semantic decoder provides semantic masks for prompted objects while classifying unprompted objects, and the text-to-vision enhancement injects domain knowledge from large language models into the segmentation process. A cross-mask spatial alignment loss ensures consistency between binary and semantic predictions, boosting both tasks. Across Med2D-16M and cross-dataset benchmarks, SEG-SAM achieves state-of-the-art performance for binary segmentation and superior semantic segmentation, demonstrating the practical value of language-informed semantic priors in clinical imaging.

Abstract

Recently, developing unified medical image segmentation models gains increasing attention, especially with the advent of the Segment Anything Model (SAM). SAM has shown promising binary segmentation performance in natural domains, however, transferring it to the medical domain remains challenging, as medical images often possess substantial inter-category overlaps. To address this, we propose the SEmantic-Guided SAM (SEG-SAM), a unified medical segmentation model that incorporates semantic medical knowledge to enhance medical segmentation performance. First, to avoid the potential conflict between binary and semantic predictions, we introduce a semantic-aware decoder independent of SAM's original decoder, specialized for both semantic segmentation on the prompted object and classification on unprompted objects in images. To further enhance the model's semantic understanding, we solicit key characteristics of medical categories from large language models and incorporate them into SEG-SAM through a text-to-vision semantic module, adaptively transferring the language information into the visual segmentation task. In the end, we introduce the cross-mask spatial alignment strategy to encourage greater overlap between the predicted masks from SEG-SAM's two decoders, thereby benefiting both predictions. Extensive experiments demonstrate that SEG-SAM outperforms state-of-the-art SAM-based methods in unified binary medical segmentation and task-specific methods in semantic medical segmentation, showcasing promising results and potential for broader medical applications.

Paper Structure

This paper contains 16 sections, 11 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: The comparison of our SEG-SAM with task-specific and SAM-based methods for medical image segmentation. In the former, each image modality is processed by a separate model. In the latter, all image modalities are handled in a unified manner, but the model can only produce binary masks. Distinct from both, our method leverages semantic learning and text knowledge to achieve unified semantic mask prediction.
  • Figure 2: Overview of the SEG-SAM framework. Give an image $I$ with the prompted object $O_p$ from visual prompts $V_p$, first, SAM's image encoder extracts visual embeddings $f_v$, while the prompt encoder encodes $V_p$ into prompt tokens $t_p$. Next, SAM's original decoder uses the original tokens $t_o$ to predict binary masks $\hat{M}_b$ for $O_p$. Then, our proposed semantic-aware decoder uses a segmentation-oriented token $t_{so}$ to predict the semantic mask $\hat{M}_s$ for $O_p$ and classification-oriented tokens $t{co}$ to capture category information of unprompted objects. Lastly, medical text descriptions generated from a pre-trained LLM are incorporated into the text summary token $t_{text}$ through a text-to-vision semantic enhancement scheme, and $t_{text}$ is embedded into the prompt tokens $t_p$ to improve segmentation performance.
  • Figure 3: Qualitative comparisons with other methods on 8 modalities. We compared with U-Mambda exp10 and Med2D intro11