ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation
Sheng Lian, Dengfeng Pan, Jianlong Cai, Guang-Yong Chen, Zhun Zhong, Zhiming Luo, Shen Zhao, Shuo Li
TL;DR
This work tackles the challenge of fine-grained lumbar spine MRI segmentation where purely visual models struggle to capture anatomical semantics. It introduces ATM-Net, a framework that fuses anatomy-aware text prompts with image features through ATPG, HASF, and CCAE, built on a Swin UNETR backbone and Bio-ClinicalBERT. The training objective combines a Dice-Focal segmentation loss with a multi-modal contrastive loss, expressed as $L_{total}=L_{DiceFocal}+L_{ftc}$, to optimize both segmentation and cross-modal alignment. Experiments on MRSpineSeg and SPIDER demonstrate state-of-the-art performance with substantial gains in Dice and boundary metrics, highlighting the practical potential of incorporating anatomy-informed textual knowledge into medical image segmentation.
Abstract
Accurate lumbar spine segmentation is crucial for diagnosing spinal disorders. Existing methods typically use coarse-grained segmentation strategies that lack the fine detail needed for precise diagnosis. Additionally, their reliance on visual-only models hinders the capture of anatomical semantics, leading to misclassified categories and poor segmentation details. To address these limitations, we present ATM-Net, an innovative framework that employs an anatomy-aware, text-guided, multi-modal fusion mechanism for fine-grained segmentation of lumbar substructures, i.e., vertebrae (VBs), intervertebral discs (IDs), and spinal canal (SC). ATM-Net adopts the Anatomy-aware Text Prompt Generator (ATPG) to adaptively convert image annotations into anatomy-aware prompts in different views. These insights are further integrated with image features via the Holistic Anatomy-aware Semantic Fusion (HASF) module, building a comprehensive anatomical context. The Channel-wise Contrastive Anatomy-Aware Enhancement (CCAE) module further enhances class discrimination and refines segmentation through class-wise channel-level multi-modal contrastive learning. Extensive experiments on the MRSpineSeg and SPIDER datasets demonstrate that ATM-Net significantly outperforms state-of-the-art methods, with consistent improvements regarding class discrimination and segmentation details. For example, ATM-Net achieves Dice of 79.39% and HD95 of 9.91 pixels on SPIDER, outperforming the competitive SpineParseNet by 8.31% and 4.14 pixels, respectively.
