Table of Contents
Fetching ...

ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation

Sheng Lian, Dengfeng Pan, Jianlong Cai, Guang-Yong Chen, Zhun Zhong, Zhiming Luo, Shen Zhao, Shuo Li

TL;DR

This work tackles the challenge of fine-grained lumbar spine MRI segmentation where purely visual models struggle to capture anatomical semantics. It introduces ATM-Net, a framework that fuses anatomy-aware text prompts with image features through ATPG, HASF, and CCAE, built on a Swin UNETR backbone and Bio-ClinicalBERT. The training objective combines a Dice-Focal segmentation loss with a multi-modal contrastive loss, expressed as $L_{total}=L_{DiceFocal}+L_{ftc}$, to optimize both segmentation and cross-modal alignment. Experiments on MRSpineSeg and SPIDER demonstrate state-of-the-art performance with substantial gains in Dice and boundary metrics, highlighting the practical potential of incorporating anatomy-informed textual knowledge into medical image segmentation.

Abstract

Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.

ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation

TL;DR

This work tackles the challenge of fine-grained lumbar spine MRI segmentation where purely visual models struggle to capture anatomical semantics. It introduces ATM-Net, a framework that fuses anatomy-aware text prompts with image features through ATPG, HASF, and CCAE, built on a Swin UNETR backbone and Bio-ClinicalBERT. The training objective combines a Dice-Focal segmentation loss with a multi-modal contrastive loss, expressed as , to optimize both segmentation and cross-modal alignment. Experiments on MRSpineSeg and SPIDER demonstrate state-of-the-art performance with substantial gains in Dice and boundary metrics, highlighting the practical potential of incorporating anatomy-informed textual knowledge into medical image segmentation.

Abstract

Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.

Paper Structure

This paper contains 19 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: (a) Task definition on the fine-grained segmentation of lumbar spine MRI. (b) Task challenges in various aspects. (c) The design comparison between the visual-only models, the existing VLMs, and our ATM-Net. (d) Our ATM-Net’s motivation in qualitative view.
  • Figure 2: Method overview. ATPG adaptively converts image annotation into anatomy-aware text prompts. These insights are integrated with visual features via HASF, building a comprehensive anatomical context. CCAE further enhances class discrimination and segmentation details through class-wise channel-level multi-modal contrastive learning. Best viewed in color.
  • Figure 2: Quantitative comparisons on overall performance. We include both established MIS models and specialized models for comparison. The best results are highlighted in bold.
  • Figure 3: The process of text prompt generation in ATPG.
  • Figure 4: The t-SNE visualization of embedding space on both datasets for Swin UNETR and our ATM-Net.
  • ...and 2 more figures