Table of Contents
Fetching ...

Automatic hip osteoarthritis grading with uncertainty estimation from computed tomography using digitally-reconstructed radiographs

Masachika Masuda, Mazen Soufi, Yoshito Otake, Keisuke Uemura, Sotaro Kono, Kazuma Takashima, Hidetoshi Hamada, Yi Gu, Masaki Takao, Seiji Okada, Nobuhiko Sugano, Yoshinobu Sato

TL;DR

This work tackles automated grading of hip osteoarthritis severity by leveraging CT-derived digitally-reconstructed radiographs (DRRs) to represent disease progression through Crowe and Kellgren-Lawrence (KL) grades. It evaluates three architectures (Vision Transformer, VGG, DenseNet) in classification and regression settings, testing both combined (seven-class) and separated (Crowe/KL) labeling schemes, and incorporating Monte-Carlo dropout to estimate model uncertainty. The approach is validated on 394 DRRs from 197 patients (internal) with external testing on 104 DRRs from 52 patients, showing high one-neighbor accuracy (ONCA > 0.90) and moderate exact-class accuracy (ECA around 0.65–0.66) in the internal dataset; external results are lower but informative due to distribution shifts in severe cases. Importantly, model uncertainty correlates with prediction errors, suggesting uncertainty estimates can serve as a surrogate for grading reliability and guide human review in large-scale CT databases; code will be publicly released.

Abstract

Progression of hip osteoarthritis (hip OA) leads to pain and disability, likely leading to surgical treatment such as hip arthroplasty at the terminal stage. The severity of hip OA is often classified using the Crowe and Kellgren-Lawrence (KL) classifications. However, as the classification is subjective, we aimed to develop an automated approach to classify the disease severity based on the two grades using digitally-reconstructed radiographs (DRRs) from CT images. Automatic grading of the hip OA severity was performed using deep learning-based models. The models were trained to predict the disease grade using two grading schemes, i.e., predicting the Crowe and KL grades separately, and predicting a new ordinal label combining both grades and representing the disease progression of hip OA. The models were trained in classification and regression settings. In addition, the model uncertainty was estimated and validated as a predictor of classification accuracy. The models were trained and validated on a database of 197 hip OA patients, and externally validated on 52 patients. The model accuracy was evaluated using exact class accuracy (ECA), one-neighbor class accuracy (ONCA), and balanced accuracy.The deep learning models produced a comparable accuracy of approximately 0.65 (ECA) and 0.95 (ONCA) in the classification and regression settings. The model uncertainty was significantly larger in cases with large classification errors (P<6e-3). In this study, an automatic approach for grading hip OA severity from CT images was developed. The models have shown comparable performance with high ONCA, which facilitates automated grading in large-scale CT databases and indicates the potential for further disease progression analysis. Classification accuracy was correlated with the model uncertainty, which would allow for the prediction of classification errors.

Automatic hip osteoarthritis grading with uncertainty estimation from computed tomography using digitally-reconstructed radiographs

TL;DR

This work tackles automated grading of hip osteoarthritis severity by leveraging CT-derived digitally-reconstructed radiographs (DRRs) to represent disease progression through Crowe and Kellgren-Lawrence (KL) grades. It evaluates three architectures (Vision Transformer, VGG, DenseNet) in classification and regression settings, testing both combined (seven-class) and separated (Crowe/KL) labeling schemes, and incorporating Monte-Carlo dropout to estimate model uncertainty. The approach is validated on 394 DRRs from 197 patients (internal) with external testing on 104 DRRs from 52 patients, showing high one-neighbor accuracy (ONCA > 0.90) and moderate exact-class accuracy (ECA around 0.65–0.66) in the internal dataset; external results are lower but informative due to distribution shifts in severe cases. Importantly, model uncertainty correlates with prediction errors, suggesting uncertainty estimates can serve as a surrogate for grading reliability and guide human review in large-scale CT databases; code will be publicly released.

Abstract

Progression of hip osteoarthritis (hip OA) leads to pain and disability, likely leading to surgical treatment such as hip arthroplasty at the terminal stage. The severity of hip OA is often classified using the Crowe and Kellgren-Lawrence (KL) classifications. However, as the classification is subjective, we aimed to develop an automated approach to classify the disease severity based on the two grades using digitally-reconstructed radiographs (DRRs) from CT images. Automatic grading of the hip OA severity was performed using deep learning-based models. The models were trained to predict the disease grade using two grading schemes, i.e., predicting the Crowe and KL grades separately, and predicting a new ordinal label combining both grades and representing the disease progression of hip OA. The models were trained in classification and regression settings. In addition, the model uncertainty was estimated and validated as a predictor of classification accuracy. The models were trained and validated on a database of 197 hip OA patients, and externally validated on 52 patients. The model accuracy was evaluated using exact class accuracy (ECA), one-neighbor class accuracy (ONCA), and balanced accuracy.The deep learning models produced a comparable accuracy of approximately 0.65 (ECA) and 0.95 (ONCA) in the classification and regression settings. The model uncertainty was significantly larger in cases with large classification errors (P<6e-3). In this study, an automatic approach for grading hip OA severity from CT images was developed. The models have shown comparable performance with high ONCA, which facilitates automated grading in large-scale CT databases and indicates the potential for further disease progression analysis. Classification accuracy was correlated with the model uncertainty, which would allow for the prediction of classification errors.
Paper Structure (21 sections, 4 equations, 11 figures, 4 tables)

This paper contains 21 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Disease grading used in the paper. DRR images representing the variations accompanying hip OA disease progression are depicted. The progression grades were constructed as combinations of Crowe and Kellgren and Lawrence (KL) gradings. Higher severity grades are accompanied by narrower space between the femoral head and acetabulum or sub-dislocation or dislocation of the femoral head from the acetabulum. The reason why this definition of disease classes was used will be explained in Section \ref{['sec:subsec_datasets']}.
  • Figure 2: Overview of the proposed method. Hip OA severity grade was automatically predicted based on the DRR image of the hip joint region automatically extracted from the CT image.
  • Figure 3: P-values of the differences between the ECA of the three models and the prediction methods under the combined and separated label as well as 1 and 50 samples settings. ✓ in (a, b) indicates that the vertical experimental settings had higher accuracy than the horizontal setting; for (c, d), sample 1 setting had higher accuracy than the samples 50 setting with a statistically significant difference (Student's t-test with Bonferroni correction, P<3e-3 for (a, b), P<1e-3 for (c, d)). ✗ in (a, b) indicates that the vertical settings yielded lower accuracy; for (c, d), sample 1 yielded lower accuracy with a statistically significant difference, while n.s. indicates no significant difference was observed. Reg: regression, Cls: classification, S50: 50 samples (w/ dropout), Com: combined, Sep: separated.
  • Figure 4: Confusion matrices of the ViT grading model in classification (left) and regression (right) settings. The confusion matrices correspond to the models yielding median ECA in both settings.
  • Figure 5: Distributions of the regression errors in each model under the combined setting. The table shows mean regression errors with inter-quartile ranges (IQR) (Mann-Whitney's U-test; Bonferroni correction P<0.02).
  • ...and 6 more figures