Table of Contents
Fetching ...

SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

Jiaming Liu, Dingwei Fan, Junyong Zhao, Chunlin Li, Haipeng Si, Liang Sun

TL;DR

SpinalSAM-R1 tackles the challenging spine CT segmentation problem by fusing a CBAM-augmented, LoRA-finetuned Segment Anything Model with DeepSeek-R1 to enable natural language-guided refinement. The system employs a five-layer architecture and a feature-enhanced SAM backbone, achieving state-of-the-art Dice and IoU while delivering real-time feedback through a PyQt5 UI and 11 NL-driven operations. Key contributions include parameter-efficient medical adaptation of SAM, anatomy-guided attention, and a semantics-driven interactive workflow that enhances clinical usability and accuracy. The work demonstrates strong practical impact by delivering high-precision spinal segmentation with responsive NL interaction, and provides a public release for broader adoption in clinical workflows.

Abstract

The anatomical structure segmentation of the spine and adjacent structures from computed tomography (CT) images is a key step for spinal disease diagnosis and treatment. However, the segmentation of CT images is impeded by low contrast and complex vertebral boundaries. Although advanced models such as the Segment Anything Model (SAM) have shown promise in various segmentation tasks, their performance in spinal CT imaging is limited by high annotation requirements and poor domain adaptability. To address these limitations, we propose SpinalSAM-R1, a multimodal vision-language interactive system that integrates a fine-tuned SAM with DeepSeek-R1, for spine CT image segmentation. Specifically, our SpinalSAM-R1 introduces an anatomy-guided attention mechanism to improve spine segmentation performance, and a semantics-driven interaction protocol powered by DeepSeek-R1, enabling natural language-guided refinement. The SpinalSAM-R1 is fine-tuned using Low-Rank Adaptation (LoRA) for efficient adaptation. We validate our SpinalSAM-R1 on the spine anatomical structure with CT images. Experimental results suggest that our method achieves superior segmentation performance. Meanwhile, we develop a PyQt5-based interactive software, which supports point, box, and text-based prompts. The system supports 11 clinical operations with 94.3\% parsing accuracy and sub-800 ms response times. The software is released on https://github.com/6jm233333/spinalsam-r1.

SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation

TL;DR

SpinalSAM-R1 tackles the challenging spine CT segmentation problem by fusing a CBAM-augmented, LoRA-finetuned Segment Anything Model with DeepSeek-R1 to enable natural language-guided refinement. The system employs a five-layer architecture and a feature-enhanced SAM backbone, achieving state-of-the-art Dice and IoU while delivering real-time feedback through a PyQt5 UI and 11 NL-driven operations. Key contributions include parameter-efficient medical adaptation of SAM, anatomy-guided attention, and a semantics-driven interactive workflow that enhances clinical usability and accuracy. The work demonstrates strong practical impact by delivering high-precision spinal segmentation with responsive NL interaction, and provides a public release for broader adoption in clinical workflows.

Abstract

The anatomical structure segmentation of the spine and adjacent structures from computed tomography (CT) images is a key step for spinal disease diagnosis and treatment. However, the segmentation of CT images is impeded by low contrast and complex vertebral boundaries. Although advanced models such as the Segment Anything Model (SAM) have shown promise in various segmentation tasks, their performance in spinal CT imaging is limited by high annotation requirements and poor domain adaptability. To address these limitations, we propose SpinalSAM-R1, a multimodal vision-language interactive system that integrates a fine-tuned SAM with DeepSeek-R1, for spine CT image segmentation. Specifically, our SpinalSAM-R1 introduces an anatomy-guided attention mechanism to improve spine segmentation performance, and a semantics-driven interaction protocol powered by DeepSeek-R1, enabling natural language-guided refinement. The SpinalSAM-R1 is fine-tuned using Low-Rank Adaptation (LoRA) for efficient adaptation. We validate our SpinalSAM-R1 on the spine anatomical structure with CT images. Experimental results suggest that our method achieves superior segmentation performance. Meanwhile, we develop a PyQt5-based interactive software, which supports point, box, and text-based prompts. The system supports 11 clinical operations with 94.3\% parsing accuracy and sub-800 ms response times. The software is released on https://github.com/6jm233333/spinalsam-r1.

Paper Structure

This paper contains 24 sections, 16 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the SpinalSAM-R1 system, divided into two functional blocks. Top Block (System Architecture Overview): Illustrates the pipeline of instruction parsing and result evaluation—natural language commands (e.g., “Open Image”, “Add [Lumbar] points”) are processed by DeepSeek-R1 into four operation categories (Image Operations, Point Operations, etc.), then output segmentation results with metrics like Dice Score and Reasoning Time. Bottom Block (Natural Language Interaction Examples): Demonstrates how explicit natural language prompts (e.g., “Generate Spine”, “Add [Lumbar] points”) are encoded via Prompt Encoder, fused with image embeddings from Image Encoder, and decoded by Mask Decoder to generate spinal segmentation results.
  • Figure 2: System architecture of SpinalSAM-R1, comprising five hierarchical layers: User Interface Layer (PyQt5-based, supports point/box/text interaction with real-time visualization), Business Logic Layer (integrates SAM model inference and DeepSeek-R1 natural language parsing), Data Service Layer (manages image loading, caching, and annotation storage), Support Module (handles annotation visualization, coordinate transformation, and model adaptation), and Infrastructure Layer (governs hardware resource allocation, model deployment, and cross-platform compatibility).
  • Figure 3: Overview of feature-enhanced SAM, showing the integration of feature-enhanced SAM with multimodal user interaction.
  • Figure 4: The SpinalSAM-R1 multimodal system supports interactive segmentation via manual annotation or natural language commands.
  • Figure 5: Segmentation results of CT lumbar images on sagittal, coronal, and axial views across different methods. From left to right, the columns show the original image, ground truth, SpinalSAM-R1, UNet, TransUNet, Swin-Unet, SAM-Med2D-box, and SAM-Med2D-point. All methods are evaluated under identical interaction prompts, with blue masks representing the predicted vertebral regions.