Table of Contents
Fetching ...

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Md. Sajeebul Islam Sk., Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam

Abstract

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Abstract

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

Paper Structure

This paper contains 18 sections, 12 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Detailed Model Architecture, the proposed multimodal vision-language framework for Lumbar Spinal Stenosis (LSS) diagnosis.
  • Figure 2: Classification performance comparison across multi-modal VLM models. (a–c) Confusion matrices from the clinical test set for BiomedCLIP, LLaVA-Med, and SmolVLM, respectively, displaying predicted versus true severity grades (A: normal, B&C: mild-to-moderate stenosis, D: severe stenosis). (d) Receiver operating characteristic (ROC) curves quantifying model discrimination performance across severity grades.
  • Figure 3: Segmentation-based severity classification performance across multi-modal VLM models. (a–c) Confusion matrices mapping pixel-level segmentation outputs to clinical severity grades for BiomedCLIP, LLaVA-Med, and SmolVLM (all trained with the proposed Adaptive PID-Tversky loss). (d) Receiver operating characteristic (ROC) curves quantifying the models' spatial discrimination performance derived from segmentation masks across clinical severity levels.
  • Figure 4: A detailed pixel-level segmentation analysis that compares the predictions of the BiomedCLIP model to expert-annotated ground truths for different levels of stenosis severity (Grade A, Grade B&C, and Grade D). There are three rows in the figure, each with a label: (a), (b), and (c). Each row shows a different patient case and stenosis grade. The first two images in each column show the model input: (1) the original axial MRI scan of the lumbar spine and (2) the ground truth stenosis region (highlight in dark maroon with the exact coverage percentage of the spine area and total annotated pixels reported). The last two images show the model predicted output: (3) the BiomedCLIP model's predicted stenosis segmentation mask (represented in green, with the predicted coverage percentage of the spine area, mean confidence score, and maximum confidence score reported) and (4) the overlay comparison analysis showing pixel-level performance metrics (Dice coefficient and IoU) along with highlighted yellow areas for correctly identified true positives, red for missed regions (false negatives), and green for false positive with the exact pixel counts and percentages provided.
  • Figure 5: Report generation performance comparison across multi-modal VLM models. (a–c) Confusion matrices from the clinical test set for BiomedCLIP, SmolVLM, and LLaVA-Med, respectively. (d) ROC curves quantifying model discrimination performance across severity grades derived from the semantic content of the automated reports.
  • ...and 1 more figures