Deep Learning-Based Noninvasive Screening of Type 2 Diabetes with Chest X-ray Images and Electronic Health Records
Sanjana Gundapaneni, Zhuo Zhi, Miguel Rodrigues
TL;DR
This study investigates noninvasive Type 2 diabetes screening by fusing chest X-ray (CXR) images with electronic health records (EHR) and electrocardiography (ECG) data. It compares two end-to-end multimodal deep learning architectures—a ViLT-based early-fusion transformer and a ResNet-LSTM joint fusion model—on MIMIC-IV derived datasets, demonstrating that end-to-end multimodal integration improves over CXR-only baselines (best AUROC around 0.862). The ResNet-LSTM joint model yields the strongest performance (AUROC ≈ 0.861–0.862) and benefits from concurrent training across modalities, while ViLT shows sensitivity to pretraining and data quality. The work highlights the diagnostic value of CXRs within multimodal pipelines for early T2DM identification, releases a preprocessing pipeline for replication, and calls for external validation to assess generalizability across diverse populations and settings.
Abstract
The imperative for early detection of type 2 diabetes mellitus (T2DM) is challenged by its asymptomatic onset and dependence on suboptimal clinical diagnostic tests, contributing to its widespread global prevalence. While research into noninvasive T2DM screening tools has advanced, conventional machine learning approaches remain limited to unimodal inputs due to extensive feature engineering requirements. In contrast, deep learning models can leverage multimodal data for a more holistic understanding of patients' health conditions. However, the potential of chest X-ray (CXR) imaging, one of the most commonly performed medical procedures, remains underexplored. This study evaluates the integration of CXR images with other noninvasive data sources, including electronic health records (EHRs) and electrocardiography signals, for T2DM detection. Utilising datasets meticulously compiled from the MIMIC-IV databases, we investigated two deep fusion paradigms: an early fusion-based multimodal transformer and a modular joint fusion ResNet-LSTM architecture. The end-to-end trained ResNet-LSTM model achieved an AUROC of 0.86, surpassing the CXR-only baseline by 2.3% with just 9863 training samples. These findings demonstrate the diagnostic value of CXRs within multimodal frameworks for identifying at-risk individuals early. Additionally, the dataset preprocessing pipeline has also been released to support further research in this domain.
