Table of Contents
Fetching ...

LungX: A Hybrid EfficientNet-Vision Transformer Architecture with Multi-Scale Attention for Accurate Pneumonia Detection

Mansur Yerzhanuly

TL;DR

This paper tackles automated pneumonia detection from chest X-rays, addressing variability in presentations and dataset distribution. It introduces LungX, a hybrid CNN–Transformer architecture that fuses EfficientNet-based multi-scale features with CBAM refinements and a pretrained DeiT transformer to capture both local detail and global lung context. On a merged RSNA Pneumonia Challenge and CheXpert dataset, LungX achieves a validation AUC of 0.9446 and accuracy of 86.58%, surpassing strong CNN baselines. The results include interpretable attention maps (CAMs) showing improved lesion localization, supporting clinical adoption, with future work targeting larger multi-center validation and model compression for deployment.

Abstract

Pneumonia remains a leading global cause of mortality where timely diagnosis is critical. We introduce LungX, a novel hybrid architecture combining EfficientNet's multi-scale features, CBAM attention mechanisms, and Vision Transformer's global context modeling for enhanced pneumonia detection. Evaluated on 20,000 curated chest X-rays from RSNA and CheXpert, LungX achieves state-of-the-art performance (86.5 percent accuracy, 0.943 AUC), representing a 6.7 percent AUC improvement over EfficientNet-B0 baselines. Visual analysis demonstrates superior lesion localization through interpretable attention maps. Future directions include multi-center validation and architectural optimizations targeting 88 percent accuracy for clinical deployment as an AI diagnostic aid.

LungX: A Hybrid EfficientNet-Vision Transformer Architecture with Multi-Scale Attention for Accurate Pneumonia Detection

TL;DR

This paper tackles automated pneumonia detection from chest X-rays, addressing variability in presentations and dataset distribution. It introduces LungX, a hybrid CNN–Transformer architecture that fuses EfficientNet-based multi-scale features with CBAM refinements and a pretrained DeiT transformer to capture both local detail and global lung context. On a merged RSNA Pneumonia Challenge and CheXpert dataset, LungX achieves a validation AUC of 0.9446 and accuracy of 86.58%, surpassing strong CNN baselines. The results include interpretable attention maps (CAMs) showing improved lesion localization, supporting clinical adoption, with future work targeting larger multi-center validation and model compression for deployment.

Abstract

Pneumonia remains a leading global cause of mortality where timely diagnosis is critical. We introduce LungX, a novel hybrid architecture combining EfficientNet's multi-scale features, CBAM attention mechanisms, and Vision Transformer's global context modeling for enhanced pneumonia detection. Evaluated on 20,000 curated chest X-rays from RSNA and CheXpert, LungX achieves state-of-the-art performance (86.5 percent accuracy, 0.943 AUC), representing a 6.7 percent AUC improvement over EfficientNet-B0 baselines. Visual analysis demonstrates superior lesion localization through interpretable attention maps. Future directions include multi-center validation and architectural optimizations targeting 88 percent accuracy for clinical deployment as an AI diagnostic aid.

Paper Structure

This paper contains 6 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Representative chest X-rays showing (a) normal lungs, (b) bacterial pneumonia with localized consolidation, and (c) viral pneumonia with diffuse interstitial opacities. These examples highlight the variability of pneumonia manifestations that make accurate visual diagnosis challenging. kermany2018identifying
  • Figure 2: Architecture of the proposed LungX (EffViT-AttnNet) model. Multi-scale features extracted from three mid-level stages of the EfficientNet-B3 backbone are refined by CBAM attention modules, fused through a multi-scale feature aggregation block, and encoded by a DeiT-Small Vision Transformer for pneumonia classification.
  • Figure 3: Training dynamics of the proposed LungX (EffViT-AttnNet) model. The plots show the evolution of Accuracy, Loss, AUC, and F1-score over 25 epochs for both training and validation sets, demonstrating smooth convergence and stable generalization performance.