Table of Contents
Fetching ...

STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation

Dandan Shan, Zihan Li, Yunxiang Li, Qingde Li, Jie Tian, Qingqi Hong

TL;DR

STPNet tackles lesion segmentation under scale and location variability by leveraging a vision-language framework that retrieves category-specific medical text during training and propagates its semantics to a segmentation network without requiring text input at inference. The method combines a Text Retrieval Network with a Segmentation Network that uses MTBlock and UTrans to fuse multi-scale textual cues with image features, aided by the Spatial Scale-aware Module. On COVID-CT, COVID-Xray, and Kvasir-SEG it achieves state-of-the-art Dice and IoU, notably excelling in polyp segmentation and pneumonia imaging. While effective, performance depends on the quality of the medical text repository and incurs extra computation, suggesting future work on uncertainty estimation, adaptive text corpora, and efficiency.

Abstract

Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.

STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation

TL;DR

STPNet tackles lesion segmentation under scale and location variability by leveraging a vision-language framework that retrieves category-specific medical text during training and propagates its semantics to a segmentation network without requiring text input at inference. The method combines a Text Retrieval Network with a Segmentation Network that uses MTBlock and UTrans to fuse multi-scale textual cues with image features, aided by the Spatial Scale-aware Module. On COVID-CT, COVID-Xray, and Kvasir-SEG it achieves state-of-the-art Dice and IoU, notably excelling in polyp segmentation and pneumonia imaging. While effective, performance depends on the quality of the medical text repository and incurs extra computation, suggesting future work on uncertainty estimation, adaptive text corpora, and efficiency.

Abstract

Accurate segmentation of lesions plays a critical role in medical image analysis and diagnosis. Traditional segmentation approaches that rely solely on visual features often struggle with the inherent uncertainty in lesion distribution and size. To address these issues, we propose STPNet, a Scale-aware Text Prompt Network that leverages vision-language modeling to enhance medical image segmentation. Our approach utilizes multi-scale textual descriptions to guide lesion localization and employs retrieval-segmentation joint learning to bridge the semantic gap between visual and linguistic modalities. Crucially, STPNet retrieves relevant textual information from a specialized medical text repository during training, eliminating the need for text input during inference while retaining the benefits of cross-modal learning. We evaluate STPNet on three datasets: COVID-Xray, COVID-CT, and Kvasir-SEG. Experimental results show that our vision-language approach outperforms state-of-the-art segmentation methods, demonstrating the effectiveness of incorporating textual semantic knowledge into medical image analysis. The code has been made publicly on https://github.com/HUANGLIZI/STPNet.

Paper Structure

This paper contains 17 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Distribution of COVID-19. Lesions may be distributed in different locations with different sizes. The corresponding text labels for the images can be categorized into four types: Infection text, Num text, Left Loc text and Right Loc text. STPNet considers different locations and different scales jointly.
  • Figure 2: Overview of our proposed STPNet. The framework consists of a Text Retrieval Network and a Segmentation Network. The Text Retrieval Network retrieves relevant text features by computing cosine similarity with image features. These retrieval text features are then recombined and fed into the Segmentation Network. The Segmentation Network, a hybrid CNN-Transformer architecture, employs MTBlock and UTrans for local and global encoding of text and image features. We utilize the Spatial Scale-aware Modules (SSM) to learn multi-scale and spatial features. Finally, the mixed-learning objectives optimize both the image segmentation and retrieval processes.
  • Figure 3: Illustration of MTBlock and UTrans Encoder. (a) In the MTBlock integration process, the average text features $\bar{F}_{text,i}$ are concatenated with image features $F_{text,i}$ after feature map formation. This combined representation undergoes convolution to produce fused visual-textual features $F_{mix,i}$. (b) In the UTrans Encoder, image features $F_{mix,i}$ processed by SSM are merged with text features $F_{text,i+1}$ using a multi-head self-attention mechanism. And, only the image-related features are extracted and inputted into the subsequent transformer.
  • Figure 4: The structure of the SSM (Spatial Scale-aware Module) involves learning multi-scale features through convolutions with different dilation rates and integrating spatial features using 1x1 convolutions.
  • Figure 5: Performance comparison between our proposed method (STPNet) and other state-of-the-art methods on the Kvasir-SEG dataset using the Precision and Recall metrics.
  • ...and 2 more figures