Stacked Ensemble of Fine-Tuned CNNs for Knee Osteoarthritis Severity Grading
Adarsh Gupta, Japleen Kaur, Tanvi Doshi, Teena Sharma, Nishchal K. Verma, Shantaram Vasikarla
TL;DR
This work tackles automatic KOA severity grading using the Kellgren-Lawrence system by building a stacked ensemble of fine-tuned CNN backbones (MobileNetV2, YOLOv8, DenseNet201) with CatBoost as the meta-learner. The approach employs a class-weighted loss to address dataset imbalance and a three-stage pipeline of preprocessing, fine-tuning, and stacking, validated on the OAI knee X-ray dataset. Experimental results show strong binary detection performance and competitive multiclass grading, with the ensemble outperforming individual CNNs and several prior methods. The authors discuss meta-learner comparisons and suggest future enhancements via bagging, feature-based meta-learning, and transformer-based architectures. Overall, the method demonstrates improved KOA detection and grading speed and reliability, aiding clinical decision-making.
Abstract
Knee Osteoarthritis (KOA) is a musculoskeletal condition that can cause significant limitations and impairments in daily activities, especially among older individuals. To evaluate the severity of KOA, typically, X-ray images of the affected knee are analyzed, and a grade is assigned based on the Kellgren-Lawrence (KL) grading system, which classifies KOA severity into five levels, ranging from 0 to 4. This approach requires a high level of expertise and time and is susceptible to subjective interpretation, thereby introducing potential diagnostic inaccuracies. To address this problem a stacked ensemble model of fine-tuned Convolutional Neural Networks (CNNs) was developed for two classification tasks: a binary classifier for detecting the presence of KOA, and a multiclass classifier for precise grading across the KL spectrum. The proposed stacked ensemble model consists of a diverse set of pre-trained architectures, including MobileNetV2, You Only Look Once (YOLOv8), and DenseNet201 as base learners and Categorical Boosting (CatBoost) as the meta-learner. This proposed model had a balanced test accuracy of 73% in multiclass classification and 87.5% in binary classification, which is higher than previous works in extant literature.
