Text-guided multi-stage cross-perception network for medical image segmentation

Gaoyu Chen; Haixia Pan

Text-guided multi-stage cross-perception network for medical image segmentation

Gaoyu Chen, Haixia Pan

TL;DR

The paper tackles the challenge of accurate medical image segmentation by integrating textual prompts to guide segmentation. It introduces the Text-guided Multi-stage Cross-perception network (TMC) with a Multi-stage Cross-attention Module (MCM) and a Multi-stage Alignment Loss (MA Loss) to foster multi-scale, cross-modal interaction and alignment. Across three diverse datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI), TMC achieves state-of-the-art Dice scores and mIoU, with ablation studies confirming the complementary benefits of MCM and MA Loss. The work advances clinical applicability by enabling language-driven segmentation with interpretable cross-attention maps, paving the way for interactive, text-guided diagnostic tools while noting limitations and avenues for future work.

Abstract

Medical image segmentation plays a crucial role in clinical medicine, serving as a key tool for auxiliary diagnosis, treatment planning, and disease monitoring. However, traditional segmentation methods such as U-Net are often limited by weak semantic expression of target regions, which stems from insufficient generalization and a lack of interactivity. Incorporating text prompts offers a promising avenue to more accurately pinpoint lesion locations, yet existing text-guided methods are still hindered by insufficient cross-modal interaction and inadequate cross-modal feature representation. To address these challenges, we propose the Text-guided Multi-stage Cross-perception network (TMC). TMC incorporates a Multi-stage Cross-attention Module (MCM) to enhance the model's understanding of fine-grained semantic details and a Multi-stage Alignment Loss (MA Loss) to improve the consistency of cross-modal semantics across different feature levels. Experimental results on three public datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI) demonstrate the superior performance of TMC, achieving Dice scores of 84.65\%, 78.39\%, and 88.09\%, respectively, and consistently outperforming both U-Net-based networks and existing text-guided methods.

Text-guided multi-stage cross-perception network for medical image segmentation

TL;DR

Abstract

Text-guided multi-stage cross-perception network for medical image segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)