Table of Contents
Fetching ...

Text-guided multi-stage cross-perception network for medical image segmentation

Gaoyu Chen, Haixia Pan

TL;DR

The paper tackles the challenge of accurate medical image segmentation by integrating textual prompts to guide segmentation. It introduces the Text-guided Multi-stage Cross-perception network (TMC) with a Multi-stage Cross-attention Module (MCM) and a Multi-stage Alignment Loss (MA Loss) to foster multi-scale, cross-modal interaction and alignment. Across three diverse datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI), TMC achieves state-of-the-art Dice scores and mIoU, with ablation studies confirming the complementary benefits of MCM and MA Loss. The work advances clinical applicability by enabling language-driven segmentation with interpretable cross-attention maps, paving the way for interactive, text-guided diagnostic tools while noting limitations and avenues for future work.

Abstract

Medical image segmentation plays a crucial role in clinical medicine, serving as a key tool for auxiliary diagnosis, treatment planning, and disease monitoring. However, traditional segmentation methods such as U-Net are often limited by weak semantic expression of target regions, which stems from insufficient generalization and a lack of interactivity. Incorporating text prompts offers a promising avenue to more accurately pinpoint lesion locations, yet existing text-guided methods are still hindered by insufficient cross-modal interaction and inadequate cross-modal feature representation. To address these challenges, we propose the Text-guided Multi-stage Cross-perception network (TMC). TMC incorporates a Multi-stage Cross-attention Module (MCM) to enhance the model's understanding of fine-grained semantic details and a Multi-stage Alignment Loss (MA Loss) to improve the consistency of cross-modal semantics across different feature levels. Experimental results on three public datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI) demonstrate the superior performance of TMC, achieving Dice scores of 84.65\%, 78.39\%, and 88.09\%, respectively, and consistently outperforming both U-Net-based networks and existing text-guided methods.

Text-guided multi-stage cross-perception network for medical image segmentation

TL;DR

The paper tackles the challenge of accurate medical image segmentation by integrating textual prompts to guide segmentation. It introduces the Text-guided Multi-stage Cross-perception network (TMC) with a Multi-stage Cross-attention Module (MCM) and a Multi-stage Alignment Loss (MA Loss) to foster multi-scale, cross-modal interaction and alignment. Across three diverse datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI), TMC achieves state-of-the-art Dice scores and mIoU, with ablation studies confirming the complementary benefits of MCM and MA Loss. The work advances clinical applicability by enabling language-driven segmentation with interpretable cross-attention maps, paving the way for interactive, text-guided diagnostic tools while noting limitations and avenues for future work.

Abstract

Medical image segmentation plays a crucial role in clinical medicine, serving as a key tool for auxiliary diagnosis, treatment planning, and disease monitoring. However, traditional segmentation methods such as U-Net are often limited by weak semantic expression of target regions, which stems from insufficient generalization and a lack of interactivity. Incorporating text prompts offers a promising avenue to more accurately pinpoint lesion locations, yet existing text-guided methods are still hindered by insufficient cross-modal interaction and inadequate cross-modal feature representation. To address these challenges, we propose the Text-guided Multi-stage Cross-perception network (TMC). TMC incorporates a Multi-stage Cross-attention Module (MCM) to enhance the model's understanding of fine-grained semantic details and a Multi-stage Alignment Loss (MA Loss) to improve the consistency of cross-modal semantics across different feature levels. Experimental results on three public datasets (QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI) demonstrate the superior performance of TMC, achieving Dice scores of 84.65\%, 78.39\%, and 88.09\%, respectively, and consistently outperforming both U-Net-based networks and existing text-guided methods.

Paper Structure

This paper contains 23 sections, 15 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the proposed Text-guided Multi-stage Cross-perception network (TMC). The image is encoded by a Swin-based visual encoder, while the textual description is processed by a BERT-based language encoder. At selected stages, the Multi-stage Cross-attention Module (MCM) performs bidirectional cross-attention between visual features $V_i$ and language features $L_i$, producing cross-modally enhanced features $F_V^{i}$. The Multi-stage Alignment Loss $\mathcal{L}_{\mathrm{align}}^{i}$ is applied before each cross-attention block to align stage-wise visual and textual embeddings. The fused multi-scale features are propagated through a U-shaped CNN decoder to generate the final segmentation mask, supervised by the segmentation loss $\mathcal{L}_{\mathrm{seg}}$.
  • Figure 2: Qualitative comparison of segmentation results on QaTa-COV19, MosMedData, and Duke-Breast-Cancer-MRI. The proposed TMC produces more accurate and sharper lesion delineations compared with state-of-the-art baseline methods.
  • Figure 3: The heatmap demonstrates the strong capability of our TMC in capturing the lesion regions.