SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

Jiaming Liu; Dingwei Fan; Junyong Zhao; Chunlin Li; Haipeng Si; Liang Sun

SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

Jiaming Liu, Dingwei Fan, Junyong Zhao, Chunlin Li, Haipeng Si, Liang Sun

TL;DR

SpinalSAM-R1 tackles the challenging spine CT segmentation problem by fusing a CBAM-augmented, LoRA-finetuned Segment Anything Model with DeepSeek-R1 to enable natural language-guided refinement. The system employs a five-layer architecture and a feature-enhanced SAM backbone, achieving state-of-the-art Dice and IoU while delivering real-time feedback through a PyQt5 UI and 11 NL-driven operations. Key contributions include parameter-efficient medical adaptation of SAM, anatomy-guided attention, and a semantics-driven interactive workflow that enhances clinical usability and accuracy. The work demonstrates strong practical impact by delivering high-precision spinal segmentation with responsive NL interaction, and provides a public release for broader adoption in clinical workflows.

Abstract

The anatomical structure segmentation of the spine and adjacent structures from computed tomography (CT) images is a key step for spinal disease diagnosis and treatment. However, the segmentation of CT images is impeded by low contrast and complex vertebral boundaries. Although advanced models such as the Segment Anything Model (SAM) have shown promise in various segmentation tasks, their performance in spinal CT imaging is limited by high annotation requirements and poor domain adaptability. To address these limitations, we propose SpinalSAM-R1, a multimodal vision-language interactive system that integrates a fine-tuned SAM with DeepSeek-R1, for spine CT image segmentation. Specifically, our SpinalSAM-R1 introduces an anatomy-guided attention mechanism to improve spine segmentation performance, and a semantics-driven interaction protocol powered by DeepSeek-R1, enabling natural language-guided refinement. The SpinalSAM-R1 is fine-tuned using Low-Rank Adaptation (LoRA) for efficient adaptation. We validate our SpinalSAM-R1 on the spine anatomical structure with CT images. Experimental results suggest that our method achieves superior segmentation performance. Meanwhile, we develop a PyQt5-based interactive software, which supports point, box, and text-based prompts. The system supports 11 clinical operations with 94.3\% parsing accuracy and sub-800 ms response times. The software is released on https://github.com/6jm233333/spinalsam-r1.

SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

TL;DR

Abstract

SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)