XAttn-BMD: Multimodal Deep Learning with Cross-Attention for Femoral Neck Bone Mineral Density Estimation
Yilin Zhang, Leo D. Westbury, Elaine M. Dennison, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar
TL;DR
XAttn-BMD addresses the challenge of estimating femoral neck BMD from widely available hip X-rays by fusing image data with structured clinical metadata through a bidirectional cross-attention mechanism. A lightweight ResNet34 image encoder and an MLP metadata encoder are fused via a three-layer, four-head cross-attention module, with a layer-wise fusion weight and a key-value updater to refine representations; a Weighted Smooth L1 loss emphasizes clinically important low-BMD cases. In experiments on the Hertfordshire Cohort data, this approach outperformed concatenation baselines across regression (MSE, MAE, $R^2$) and binary screening tasks (ROC-AUC, precision, recall, F1), with robust cross-validation and visualization of field-level metadata attention. The results suggest meaningful clinical factor–level interpretability and potential utility for osteoporosis screening in settings with limited access to DXA, though external validation on larger multicenter datasets is needed.
Abstract
Poor bone health is a significant public health concern, and low bone mineral density (BMD) leads to an increased fracture risk, a key feature of osteoporosis. We present XAttn-BMD (Cross-Attention BMD), a multimodal deep learning framework that predicts femoral neck BMD from hip X-ray images and structured clinical metadata. It utilizes a novel bidirectional cross-attention mechanism to dynamically integrate image and metadata features for cross-modal mutual reinforcement. A Weighted Smooth L1 loss is tailored to address BMD imbalance and prioritize clinically significant cases. Extensive experiments on the data from the Hertfordshire Cohort Study show that our model outperforms the baseline models in regression generalization and robustness. Ablation studies confirm the effectiveness of both cross-attention fusion and the customized loss function. Experimental results show that the integration of multimodal data via cross-attention outperforms naive feature concatenation without cross-attention, reducing MSE by 16.7%, MAE by 6.03%, and increasing the R2 score by 16.4%, highlighting the effectiveness of the approach for femoral neck BMD estimation. Furthermore, screening performance was evaluated using binary classification at clinically relevant femoral neck BMD thresholds, demonstrating the model's potential in real-world scenarios.
