SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues
Yuxin Xie, Tao Zhou, Yi Zhou, Geng Chen
TL;DR
Medical image segmentation typically requires costly pixel-level annotations. This paper proposes SimTxtSeg, a weakly-supervised framework that uses simple text prompts to generate visual cues and pseudo-labels via the Segment Anything Model (SAM), combined with a Text-Vision Hybrid Attention decoder for cross-modal fusion. The approach comprises a Textual-to-Visual Cue Converter (TVCC) that translates text into visual prompts and pseudo-masks, and a Text-Vision Hybrid Attention (TVHA) module that fuses text and image features during segmentation. Empirical results on colonic polyp and MRI brain tumor datasets show pseudo-label quality approaching fully-supervised performance, with SimTxtSeg-w-TVHA achieving state-of-the-art results among weakly-supervised methods; sentence-level prompts and SAM variant choices influence outcomes. Overall, the method reduces annotation burden in medical imaging and demonstrates effective language-guided, cross-modal segmentation with potential for broader clinical impact and end-to-end integration with SAM.
Abstract
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance. Source code is available at: https://github.com/xyx1024/SimTxtSeg.
