Table of Contents
Fetching ...

Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos

TL;DR

The paper addresses label-scarce medical image segmentation by fusing vision-language foundation modeling with semi-supervised learning. It introduces VESSA, a two-stage framework where a VLM-based segmentation foundation leverages a template bank and memory to generate robust pseudo-labels, which are then used in a standard SSL pipeline and refined through mutual learning with the teacher. Stage 1 pre-trains VESSA on seven medical datasets with a reference-driven, template-guided prompting and a memory-augmented decoder; Stage 2 integrates VESSA into UniMatch v2, enabling dynamic interaction where VESSA guides pseudo-labels early on and the teacher improves over time. Across ACDC and AbdomenCT-1K, VESSA-enhanced SSL consistently outperforms state-of-the-art baselines under 1–5% labeled data, highlighting improved label-efficient segmentation and cross-domain robustness in medical imaging.

Abstract

Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

TL;DR

The paper addresses label-scarce medical image segmentation by fusing vision-language foundation modeling with semi-supervised learning. It introduces VESSA, a two-stage framework where a VLM-based segmentation foundation leverages a template bank and memory to generate robust pseudo-labels, which are then used in a standard SSL pipeline and refined through mutual learning with the teacher. Stage 1 pre-trains VESSA on seven medical datasets with a reference-driven, template-guided prompting and a memory-augmented decoder; Stage 2 integrates VESSA into UniMatch v2, enabling dynamic interaction where VESSA guides pseudo-labels early on and the teacher improves over time. Across ACDC and AbdomenCT-1K, VESSA-enhanced SSL consistently outperforms state-of-the-art baselines under 1–5% labeled data, highlighting improved label-efficient segmentation and cross-domain robustness in medical imaging.

Abstract

Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

Paper Structure

This paper contains 21 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of segmentation performance (Dice score) between our method and two recent state-of-the-art approaches on the ACDC dataset under different labeled data ratios.
  • Figure 2: Overview of our model.(a) VESSA: During training, an input image, its matched template overlay, and a reference text describing the image–template pair are fed into the VLM (Qwen3-VL). The generated text containing the <SEG> token is mapped by an MLP to a prompt embedding for the prompt encoder of the segmentation foundation model, while the matched template and its annotations are stored in the model’s memory bank for mask prediction. (b) Specialist Model: A student–teacher network applies weak and strong augmentations to enforce consistency on each input image, while VESSA provides additional pseudo-label supervision. In later training stages, student predictions are fed back to VESSA to refine its pseudo-labels by supplying spatial and localization cues to VESSA’s mask decoder.
  • Figure 3: Three-component prompts sent to VESSA: From left to right, the target image, the text prompt, and a template sample used to guide the segmentation. The templates are automatically selected through a matching mechanism. The two images on the right show the output segmentation (binary mask & overlay).
  • Figure 4: Comparison of the qualitative results on ACDC (1% labeled data setting) and AbdomenCT1k (5% labeled data setting)
  • Figure 5: Comparison of the qualitative results on all segmentation classes of ACDC (1% labeled data setting)
  • ...and 1 more figures