Table of Contents
Fetching ...

DuSSS: Dual Semantic Similarity-Supervised Vision-Language Model for Semi-Supervised Medical Image Segmentation

Qingtao Pan, Wenhao Qiao, Jingjiao Lou, Bing Ji, Shuo Li

TL;DR

DuSSS tackles semi-supervised medical image segmentation by leveraging a Vision-Language Model with uncertainty-aware cross- and intra-modal learning. It introduces Dual Contrastive Learning (DCL) to learn multiple semantic correspondences and Semantic Similarity-Supervision (SSS) to regulate similarity under distribution-based uncertainty. A text-guided SSMIS module refines pseudo-labels by grounding segmentation in textual prompts, improving consistency regularization. On QaTa-COV19, BM-Seg, and MoNuSeg, DuSSS achieves Dice scores of 82.52%, 74.61%, and 78.03% respectively, demonstrating strong performance gains and robustness to uncertainty.

Abstract

Semi-supervised medical image segmentation (SSMIS) uses consistency learning to regularize model training, which alleviates the burden of pixel-wise manual annotations. However, it often suffers from error supervision from low-quality pseudo labels. Vision-Language Model (VLM) has great potential to enhance pseudo labels by introducing text prompt guided multimodal supervision information. It nevertheless faces the cross-modal problem: the obtained messages tend to correspond to multiple targets. To address aforementioned problems, we propose a Dual Semantic Similarity-Supervised VLM (DuSSS) for SSMIS. Specifically, 1) a Dual Contrastive Learning (DCL) is designed to improve cross-modal semantic consistency by capturing intrinsic representations within each modality and semantic correlations across modalities. 2) To encourage the learning of multiple semantic correspondences, a Semantic Similarity-Supervision strategy (SSS) is proposed and injected into each contrastive learning process in DCL, supervising semantic similarity via the distribution-based uncertainty levels. Furthermore, a novel VLM-based SSMIS network is designed to compensate for the quality deficiencies of pseudo-labels. It utilizes the pretrained VLM to generate text prompt guided supervision information, refining the pseudo label for better consistency regularization. Experimental results demonstrate that our DuSSS achieves outstanding performance with Dice of 82.52%, 74.61% and 78.03% on three public datasets (QaTa-COV19, BM-Seg and MoNuSeg).

DuSSS: Dual Semantic Similarity-Supervised Vision-Language Model for Semi-Supervised Medical Image Segmentation

TL;DR

DuSSS tackles semi-supervised medical image segmentation by leveraging a Vision-Language Model with uncertainty-aware cross- and intra-modal learning. It introduces Dual Contrastive Learning (DCL) to learn multiple semantic correspondences and Semantic Similarity-Supervision (SSS) to regulate similarity under distribution-based uncertainty. A text-guided SSMIS module refines pseudo-labels by grounding segmentation in textual prompts, improving consistency regularization. On QaTa-COV19, BM-Seg, and MoNuSeg, DuSSS achieves Dice scores of 82.52%, 74.61%, and 78.03% respectively, demonstrating strong performance gains and robustness to uncertainty.

Abstract

Semi-supervised medical image segmentation (SSMIS) uses consistency learning to regularize model training, which alleviates the burden of pixel-wise manual annotations. However, it often suffers from error supervision from low-quality pseudo labels. Vision-Language Model (VLM) has great potential to enhance pseudo labels by introducing text prompt guided multimodal supervision information. It nevertheless faces the cross-modal problem: the obtained messages tend to correspond to multiple targets. To address aforementioned problems, we propose a Dual Semantic Similarity-Supervised VLM (DuSSS) for SSMIS. Specifically, 1) a Dual Contrastive Learning (DCL) is designed to improve cross-modal semantic consistency by capturing intrinsic representations within each modality and semantic correlations across modalities. 2) To encourage the learning of multiple semantic correspondences, a Semantic Similarity-Supervision strategy (SSS) is proposed and injected into each contrastive learning process in DCL, supervising semantic similarity via the distribution-based uncertainty levels. Furthermore, a novel VLM-based SSMIS network is designed to compensate for the quality deficiencies of pseudo-labels. It utilizes the pretrained VLM to generate text prompt guided supervision information, refining the pseudo label for better consistency regularization. Experimental results demonstrate that our DuSSS achieves outstanding performance with Dice of 82.52%, 74.61% and 78.03% on three public datasets (QaTa-COV19, BM-Seg and MoNuSeg).

Paper Structure

This paper contains 18 sections, 18 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The VLM has potential to enhance pseudo labels via text-guided mask, improving consistency learning.
  • Figure 2: The limitations of current VLM methods and the solution of our DuSSS.
  • Figure 3: The framework of our DuSSS driven VLM for SSMIS. Step 1: Our DuSSS improves the ability of uncertainty understanding in VLM pre-training, thus enhancing the model's robustness for image-text alignment. Step 2: The text-guided SSMIS improves the quality of pseudo-labels for reliable semi-supervised consistency learning.
  • Figure 4: The visual superiority of the proposed method (DuSSS) on the QaTa-COV19, BM-Seg and MoNuSeg datasets. The proposed method shows high-quality segmentation, compared with semi-supervised and VLM-based methods.
  • Figure 5: Our DuSSS effectively addresses the uncertainty problems resulting from the diverse correspondences between image and text. The DuSSS shows great target region activation effects for multiple similarly yet lexically ambiguous texts, highlighting its powerful robustness.