Deep Learning Models to Automate the Scoring of Hand Radiographs for Rheumatoid Arthritis
Zhiyan Bo, Laura C. Coates, Bartlomiej W. Papiez
TL;DR
This work presents DL pipelines that predict the SvdH hand radiograph damage score and an RA severity class directly from full-hand X‑rays, avoiding explicit joint localization. By leveraging transfer learning from a large pediatric single-hand X‑ray dataset and stacking ensembles of ResNet and MobileNetV2 models, the authors achieve substantial gains over baselines, with regression PCCs around $0.92$–$0.93$ and RMSE near $18$, and classification PCCs around $0.86$ with MAEs ~1.1 and RMSEs ~1.6. Grad‑CAM visualizations indicate the models focus on clinically relevant anatomical structures in most cases, suggesting potential clinical applicability. The results achieve performance approaching experienced radiologists and highlight practical considerations such as data imbalance and the need for more ordinal‑loss designs to further improve RA severity quantification. These approaches offer a scalable, non‑invasive means to monitor RA progression in both clinical practice and trials.
Abstract
The van der Heijde modification of the Sharp (SvdH) score is a widely used radiographic scoring method to quantify damage in Rheumatoid Arthritis (RA) in clinical trials. However, its complexity with a necessity to score each individual joint, and the expertise required limit its application in clinical practice, especially in disease progression measurement. In this work, we addressed this limitation by developing a bespoke, automated pipeline that is capable of predicting the SvdH score and RA severity from hand radiographs without the need to localise the joints first. Using hand radiographs from RA and suspected RA patients, we first investigated the performance of the state-of-the-art architectures in predicting the total SvdH score for hands and wrists and its corresponding severity class. Secondly, we leveraged publicly available data sets to perform transfer learning with different finetuning schemes and ensemble learning, which resulted in substantial improvement in model performance being on par with an experienced human reader. The best model for RA scoring achieved a Pearson's correlation coefficient (PCC) of 0.925 and root mean squared error (RMSE) of 18.02, while the best model for RA severity classification achieved an accuracy of 0.358 and PCC of 0.859. Our score prediction model attained almost comparable accuracy with experienced radiologists (PCC = 0.97, RMSE = 18.75). Finally, using Grad-CAM, we showed that our models could focus on the anatomical structures in hands and wrists which clinicians deemed as relevant to RA progression in the majority of cases.
