LungX: A Hybrid EfficientNet-Vision Transformer Architecture with Multi-Scale Attention for Accurate Pneumonia Detection
Mansur Yerzhanuly
TL;DR
This paper tackles automated pneumonia detection from chest X-rays, addressing variability in presentations and dataset distribution. It introduces LungX, a hybrid CNN–Transformer architecture that fuses EfficientNet-based multi-scale features with CBAM refinements and a pretrained DeiT transformer to capture both local detail and global lung context. On a merged RSNA Pneumonia Challenge and CheXpert dataset, LungX achieves a validation AUC of 0.9446 and accuracy of 86.58%, surpassing strong CNN baselines. The results include interpretable attention maps (CAMs) showing improved lesion localization, supporting clinical adoption, with future work targeting larger multi-center validation and model compression for deployment.
Abstract
Pneumonia remains a leading global cause of mortality where timely diagnosis is critical. We introduce LungX, a novel hybrid architecture combining EfficientNet's multi-scale features, CBAM attention mechanisms, and Vision Transformer's global context modeling for enhanced pneumonia detection. Evaluated on 20,000 curated chest X-rays from RSNA and CheXpert, LungX achieves state-of-the-art performance (86.5 percent accuracy, 0.943 AUC), representing a 6.7 percent AUC improvement over EfficientNet-B0 baselines. Visual analysis demonstrates superior lesion localization through interpretable attention maps. Future directions include multi-center validation and architectural optimizations targeting 88 percent accuracy for clinical deployment as an AI diagnostic aid.
