STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation
Dandan Shan, Zihan Li, Yunxiang Li, Qingde Li, Jie Tian, Qingqi Hong
TL;DR
STPNet tackles lesion segmentation under scale and location variability by leveraging a vision-language framework that retrieves category-specific medical text during training and propagates its semantics to a segmentation network without requiring text input at inference. The method combines a Text Retrieval Network with a Segmentation Network that uses MTBlock and UTrans to fuse multi-scale textual cues with image features, aided by the Spatial Scale-aware Module. On COVID-CT, COVID-Xray, and Kvasir-SEG it achieves state-of-the-art Dice and IoU, notably excelling in polyp segmentation and pneumonia imaging. While effective, performance depends on the quality of the medical text repository and incurs extra computation, suggesting future work on uncertainty estimation, adaptive text corpora, and efficiency.
Abstract
Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.
