Table of Contents
Fetching ...

Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification

Anirudh Mazumder, Jianguo Liu

TL;DR

The paper addresses automated lung-disease classification from chest X-rays across 14 classes by proposing a dual-stage architecture that fuses a Vision Transformer (ViT) and a Swin Transformer. The ViT serves as a feature extractor pretrained on ImageNet-21k, and the Swin Transformer refines these features with hierarchical, window-based attention, with their outputs concatenated and fed to a classifier. With data normalization, augmentation, and Binary Cross-Entropy with Logits optimization, the model achieves a label-level accuracy of 92.06% on unseen data, though performance may be affected by class imbalance as reflected in the precision–recall behavior. The work highlights the potential of transformer-based dual-stage architectures for rapid, scalable lung-disease diagnosis from X-rays and suggests future directions including more compute, broader dataset benchmarking, and clinical integration to improve real-world impact.

Abstract

Lung diseases have become a prevalent problem throughout the United States, affecting over 34 million people. Accurate and timely diagnosis of the different types of lung diseases is critical, and Artificial Intelligence (AI) methods could speed up these processes. A dual-stage vision transformer is built throughout this research by integrating a Vision Transformer (ViT) and a Swin Transformer to classify 14 different lung diseases from X-ray scans of patients with these diseases. The proposed model achieved an accuracy of 92.06% on a label-level when making predictions on an unseen testing subset of the dataset after data preprocessing and training the neural network. The model showed promise for accurately classifying lung diseases and diagnosing patients who suffer from these harmful diseases.

Developing a Dual-Stage Vision Transformer Model for Lung Disease Classification

TL;DR

The paper addresses automated lung-disease classification from chest X-rays across 14 classes by proposing a dual-stage architecture that fuses a Vision Transformer (ViT) and a Swin Transformer. The ViT serves as a feature extractor pretrained on ImageNet-21k, and the Swin Transformer refines these features with hierarchical, window-based attention, with their outputs concatenated and fed to a classifier. With data normalization, augmentation, and Binary Cross-Entropy with Logits optimization, the model achieves a label-level accuracy of 92.06% on unseen data, though performance may be affected by class imbalance as reflected in the precision–recall behavior. The work highlights the potential of transformer-based dual-stage architectures for rapid, scalable lung-disease diagnosis from X-rays and suggests future directions including more compute, broader dataset benchmarking, and clinical integration to improve real-world impact.

Abstract

Lung diseases have become a prevalent problem throughout the United States, affecting over 34 million people. Accurate and timely diagnosis of the different types of lung diseases is critical, and Artificial Intelligence (AI) methods could speed up these processes. A dual-stage vision transformer is built throughout this research by integrating a Vision Transformer (ViT) and a Swin Transformer to classify 14 different lung diseases from X-ray scans of patients with these diseases. The proposed model achieved an accuracy of 92.06% on a label-level when making predictions on an unseen testing subset of the dataset after data preprocessing and training the neural network. The model showed promise for accurately classifying lung diseases and diagnosing patients who suffer from these harmful diseases.
Paper Structure (20 sections, 2 figures)

This paper contains 20 sections, 2 figures.

Figures (2)

  • Figure 1: Distribution of the classes throughout the dataset.
  • Figure 2: Graph showing the PR curve that was developed based on the model to assess its performance across different classification thresholds.