Text-guided multi-stage cross-perception network for medical image segmentation
Gaoyu Chen, Haixia Pan
TL;DR
The paper tackles the challenge of accurate medical image segmentation by integrating textual prompts to guide segmentation. It introduces the Text-guided Multi-stage Cross-perception network (TMC) with a Multi-stage Cross-attention Module (MCM) and a Multi-stage Alignment Loss (MA Loss) to foster multi-scale, cross-modal interaction and alignment. Across three diverse datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI), TMC achieves state-of-the-art Dice scores and mIoU, with ablation studies confirming the complementary benefits of MCM and MA Loss. The work advances clinical applicability by enabling language-driven segmentation with interpretable cross-attention maps, paving the way for interactive, text-guided diagnostic tools while noting limitations and avenues for future work.
Abstract
Medical image segmentation plays a crucial role in clinical medicine, serving as a key tool for auxiliary diagnosis, treatment planning, and disease monitoring. However, traditional segmentation methods such as U-Net are often limited by weak semantic expression of target regions, which stems from insufficient generalization and a lack of interactivity. Incorporating text prompts offers a promising avenue to more accurately pinpoint lesion locations, yet existing text-guided methods are still hindered by insufficient cross-modal interaction and inadequate cross-modal feature representation. To address these challenges, we propose the Text-guided Multi-stage Cross-perception network (TMC). TMC incorporates a Multi-stage Cross-attention Module (MCM) to enhance the model's understanding of fine-grained semantic details and a Multi-stage Alignment Loss (MA Loss) to improve the consistency of cross-modal semantics across different feature levels. Experimental results on three public datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI) demonstrate the superior performance of TMC, achieving Dice scores of 84.65\%, 78.39\%, and 88.09\%, respectively, and consistently outperforming both U-Net-based networks and existing text-guided methods.
