Table of Contents
Fetching ...

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan

TL;DR

SpineMed-450k provides a provenance-rich, vertebral-level multimodal instruction dataset co-designed with spine clinicians to support level-aware reasoning across X-ray, CT, and MRI. SpineBench offers a clinically grounded evaluation framework to assess LVLMs on imaging reports, diagnosis, and surgical planning with a real-hospital-test subset. A SpineGPT model fine-tuned on SpineMed-450k shows strong, consistent gains across SpineBench tasks, exposing weaknesses in open-source LVLMs' spine reasoning. The work demonstrates how targeted, traceable spine data can enable AI to function as a practical clinical collaborator, with potential to improve diagnostic clarity and planning utility in spine care.

Abstract

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

TL;DR

SpineMed-450k provides a provenance-rich, vertebral-level multimodal instruction dataset co-designed with spine clinicians to support level-aware reasoning across X-ray, CT, and MRI. SpineBench offers a clinically grounded evaluation framework to assess LVLMs on imaging reports, diagnosis, and surgical planning with a real-hospital-test subset. A SpineGPT model fine-tuned on SpineMed-450k shows strong, consistent gains across SpineBench tasks, exposing weaknesses in open-source LVLMs' spine reasoning. The work demonstrates how targeted, traceable spine data can enable AI to function as a practical clinical collaborator, with potential to improve diagnostic clarity and planning utility in spine care.

Abstract

Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model's outputs.

Paper Structure

This paper contains 43 sections, 3 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Benchmark performance of SpineGPT
  • Figure 2: Overview of SpineMed-450k. Training data was curated from textbooks, public datasets, clinical records, medical guidelines, and hospitals. The process involved data preprocessing, annotation generation, and a final clinician review. Our dataset comprises four types: multi-choice QA, open-ended QA, multi-round dialogues, and reports.
  • Figure 3: Generation pipeline of SpineMed-450k. The pipeline involves data preprocessing (including de-identification, deduplication, and OCR) followed by expert LLM-driven curation. This process generates 450k items for tasks like QA, medical reports, and consultations across various orthopedic subspecialties.
  • Figure 4: Statistics of SpineMed-450k. (a) Distribution of medical records across various hospitals. (b) The prevalence of various orthopedic and spinal diseases. (c) Distribution of different modals and languages. (d) Benchmark token length distribution: blue (non-report tokens), pink (report tokens).
  • Figure 5: Consistency evaluation of large models and scores given by medical experts
  • ...and 15 more figures