Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering
Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Farshid Rostami Pouria, Behzad Moshiri, Md. Jalil Piran, Oliver Faust
TL;DR
Problem: DXA remains the gold standard but is costly and inaccessible in many settings, limiting mass osteoporosis screening, necessitating accurate, interpretable AI using routine X-ray and clinical data. Approach: The paper presents an explainable multi-modal framework that leverages three pre-trained CNNs (VGG19, InceptionV3, ResNet50) to extract X-ray features, applies PCA for dimensionality reduction, and uses a clustering-based Component Selection and Synergistic Combination (CSSC) layer to form a compact, representative feature set fused with clinical screening data, feeding an FCN classifier. Contributions: The CSSC-based feature selection reduces redundancy while preserving diverse patterns, and SHAP analysis highlights Medical History, BMI, and Height as key drivers, indicating clinical data dominance over image-derived features. Findings: On test data, the model achieves strong explanatory power (Generalized R-Square = 0.9729, Entropy R-Square = 0.9307, RASE = 0.0761, MAD = 0.0575, Log-Likelihood = -2.85) and provides transparent predictions via SHAP. Significance: This work advances trustworthy AI for osteoporosis by linking predictions to medically meaningful inputs and enabling clinical integration with lower data requirements.
Abstract
Osteoporosis is a common condition that increases fracture risk, especially in older adults. Early diagnosis is vital for preventing fractures, reducing treatment costs, and preserving mobility. However, healthcare providers face challenges like limited labeled data and difficulties in processing medical images. This study presents a novel multi-modal learning framework that integrates clinical and imaging data to improve diagnostic accuracy and model interpretability. The model utilizes three pre-trained networks-VGG19, InceptionV3, and ResNet50-to extract deep features from X-ray images. These features are transformed using PCA to reduce dimensionality and focus on the most relevant components. A clustering-based selection process identifies the most representative components, which are then combined with preprocessed clinical data and processed through a fully connected network (FCN) for final classification. A feature importance plot highlights key variables, showing that Medical History, BMI, and Height were the main contributors, emphasizing the significance of patient-specific data. While imaging features were valuable, they had lower importance, indicating that clinical data are crucial for accurate predictions. This framework promotes precise and interpretable predictions, enhancing transparency and building trust in AI-driven diagnoses for clinical integration.
