Table of Contents
Fetching ...

SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

Yuxin Xie, Tao Zhou, Yi Zhou, Geng Chen

TL;DR

Medical image segmentation typically requires costly pixel-level annotations. This paper proposes SimTxtSeg, a weakly-supervised framework that uses simple text prompts to generate visual cues and pseudo-labels via the Segment Anything Model (SAM), combined with a Text-Vision Hybrid Attention decoder for cross-modal fusion. The approach comprises a Textual-to-Visual Cue Converter (TVCC) that translates text into visual prompts and pseudo-masks, and a Text-Vision Hybrid Attention (TVHA) module that fuses text and image features during segmentation. Empirical results on colonic polyp and MRI brain tumor datasets show pseudo-label quality approaching fully-supervised performance, with SimTxtSeg-w-TVHA achieving state-of-the-art results among weakly-supervised methods; sentence-level prompts and SAM variant choices influence outcomes. Overall, the method reduces annotation burden in medical imaging and demonstrates effective language-guided, cross-modal segmentation with potential for broader clinical impact and end-to-end integration with SAM.

Abstract

Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance. Source code is available at: https://github.com/xyx1024/SimTxtSeg.

SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues

TL;DR

Medical image segmentation typically requires costly pixel-level annotations. This paper proposes SimTxtSeg, a weakly-supervised framework that uses simple text prompts to generate visual cues and pseudo-labels via the Segment Anything Model (SAM), combined with a Text-Vision Hybrid Attention decoder for cross-modal fusion. The approach comprises a Textual-to-Visual Cue Converter (TVCC) that translates text into visual prompts and pseudo-masks, and a Text-Vision Hybrid Attention (TVHA) module that fuses text and image features during segmentation. Empirical results on colonic polyp and MRI brain tumor datasets show pseudo-label quality approaching fully-supervised performance, with SimTxtSeg-w-TVHA achieving state-of-the-art results among weakly-supervised methods; sentence-level prompts and SAM variant choices influence outcomes. Overall, the method reduces annotation burden in medical imaging and demonstrates effective language-guided, cross-modal segmentation with potential for broader clinical impact and end-to-end integration with SAM.

Abstract

Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance. Source code is available at: https://github.com/xyx1024/SimTxtSeg.
Paper Structure (12 sections, 7 equations, 3 figures, 2 tables)

This paper contains 12 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The framework of SimTxtSeg. The textual-to-visual cue converter enables SAM to generate pseudo masks via text cues. Then, the weakly-supervised segmentation model is enhanced by text-vision hybrid attention.
  • Figure 2: Detailed structure of our text-vision hybrid attention decoder layer, containing the essential dual-way cross-modal attention and channel attention.
  • Figure 3: Qualitative visualization on polyp and brain tumor segmentation.