Table of Contents
Fetching ...

Deep Learning Models to Automate the Scoring of Hand Radiographs for Rheumatoid Arthritis

Zhiyan Bo, Laura C. Coates, Bartlomiej W. Papiez

TL;DR

This work presents DL pipelines that predict the SvdH hand radiograph damage score and an RA severity class directly from full-hand X‑rays, avoiding explicit joint localization. By leveraging transfer learning from a large pediatric single-hand X‑ray dataset and stacking ensembles of ResNet and MobileNetV2 models, the authors achieve substantial gains over baselines, with regression PCCs around $0.92$–$0.93$ and RMSE near $18$, and classification PCCs around $0.86$ with MAEs ~1.1 and RMSEs ~1.6. Grad‑CAM visualizations indicate the models focus on clinically relevant anatomical structures in most cases, suggesting potential clinical applicability. The results achieve performance approaching experienced radiologists and highlight practical considerations such as data imbalance and the need for more ordinal‑loss designs to further improve RA severity quantification. These approaches offer a scalable, non‑invasive means to monitor RA progression in both clinical practice and trials.

Abstract

The van der Heijde modification of the Sharp (SvdH) score is a widely used radiographic scoring method to quantify damage in Rheumatoid Arthritis (RA) in clinical trials. However, its complexity with a necessity to score each individual joint, and the expertise required limit its application in clinical practice, especially in disease progression measurement. In this work, we addressed this limitation by developing a bespoke, automated pipeline that is capable of predicting the SvdH score and RA severity from hand radiographs without the need to localise the joints first. Using hand radiographs from RA and suspected RA patients, we first investigated the performance of the state-of-the-art architectures in predicting the total SvdH score for hands and wrists and its corresponding severity class. Secondly, we leveraged publicly available data sets to perform transfer learning with different finetuning schemes and ensemble learning, which resulted in substantial improvement in model performance being on par with an experienced human reader. The best model for RA scoring achieved a Pearson's correlation coefficient (PCC) of 0.925 and root mean squared error (RMSE) of 18.02, while the best model for RA severity classification achieved an accuracy of 0.358 and PCC of 0.859. Our score prediction model attained almost comparable accuracy with experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, using Grad-CAM, we showed that our models could focus on the anatomical structures in hands and wrists which clinicians deemed as relevant to RA progression in the majority of cases.

Deep Learning Models to Automate the Scoring of Hand Radiographs for Rheumatoid Arthritis

TL;DR

This work presents DL pipelines that predict the SvdH hand radiograph damage score and an RA severity class directly from full-hand X‑rays, avoiding explicit joint localization. By leveraging transfer learning from a large pediatric single-hand X‑ray dataset and stacking ensembles of ResNet and MobileNetV2 models, the authors achieve substantial gains over baselines, with regression PCCs around and RMSE near , and classification PCCs around with MAEs ~1.1 and RMSEs ~1.6. Grad‑CAM visualizations indicate the models focus on clinically relevant anatomical structures in most cases, suggesting potential clinical applicability. The results achieve performance approaching experienced radiologists and highlight practical considerations such as data imbalance and the need for more ordinal‑loss designs to further improve RA severity quantification. These approaches offer a scalable, non‑invasive means to monitor RA progression in both clinical practice and trials.

Abstract

The van der Heijde modification of the Sharp (SvdH) score is a widely used radiographic scoring method to quantify damage in Rheumatoid Arthritis (RA) in clinical trials. However, its complexity with a necessity to score each individual joint, and the expertise required limit its application in clinical practice, especially in disease progression measurement. In this work, we addressed this limitation by developing a bespoke, automated pipeline that is capable of predicting the SvdH score and RA severity from hand radiographs without the need to localise the joints first. Using hand radiographs from RA and suspected RA patients, we first investigated the performance of the state-of-the-art architectures in predicting the total SvdH score for hands and wrists and its corresponding severity class. Secondly, we leveraged publicly available data sets to perform transfer learning with different finetuning schemes and ensemble learning, which resulted in substantial improvement in model performance being on par with an experienced human reader. The best model for RA scoring achieved a Pearson's correlation coefficient (PCC) of 0.925 and root mean squared error (RMSE) of 18.02, while the best model for RA severity classification achieved an accuracy of 0.358 and PCC of 0.859. Our score prediction model attained almost comparable accuracy with experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, using Grad-CAM, we showed that our models could focus on the anatomical structures in hands and wrists which clinicians deemed as relevant to RA progression in the majority of cases.
Paper Structure (26 sections, 5 equations, 5 figures, 2 tables)

This paper contains 26 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (A) An example of the radiographs from Wang et al. Wang2022 with an average SvdH score of 47. The bones and joints assessed in the left hand and wrist are highlighted in yellow and orange for reference. Examples of bone erosions and JSN are indicated by blue arrows and boxes. (B) The SvdH class distribution of samples used in this paper.
  • Figure 2: Predicted vs. true SvdH scores for the best tuned regression models with transfer learning and the ensemble learning model that combines them. The mean and SD of the deciles are plotted for reference. The models yielded better performance in predicting early-stage cases. Ensemble learning reduced prediction errors on average compared to independent models.
  • Figure 3: Grad-CAM heatmaps of ResNet-50:RBs-1 regression model for examples of TN, FP, TP and FN. The true and predicted scores are provided. In FN, the overlooked changes in wrists are circled.
  • Figure 4: Confusion matrices of (A) the best baseline classifier (ResNet-34), (B) the best pretrained classifier (ResNet-50:finetuned), (C) the best ensemble classifier, and (D) the ensemble regression model. The numbers and proportions of images from the same class that fell into different predicted classes are provided. The ensemble classifier yielded the smallest misclassifications and the highest overall accuracy.
  • Figure 5: Grad-CAM heatmaps of ResNet-50:finetuned classifier for examples of TN, FP, TP and FN. In TP, some overlooked changes in fingers are circled.