Interpretable CNN-Multilevel Attention Transformer for Rapid Recognition of Pneumonia from Chest X-Ray Images

Shengchao Chen; Sufen Ren; Guanjun Wang; Mengxing Huang; Chenyang Xue

Interpretable CNN-Multilevel Attention Transformer for Rapid Recognition of Pneumonia from Chest X-Ray Images

Shengchao Chen, Sufen Ren, Guanjun Wang, Mengxing Huang, Chenyang Xue

TL;DR

The paper tackles rapid, interpretable pneumonia recognition from chest X-rays in the COVID-19 era. It proposes CNN-MMSA-Transformer (CMT), a cascade of CNN-based feature extraction (CIFE) and a low-complexity Multilevel Multi-Head Self-Attention Transformer (MMSA-Transformer), enabled by a data-augmentation module (FFA) that uses random Beta sampling. Key contributions include the FFA augmentation strategy, the MMSA mechanism with lower computational cost than conventional MHSA, and comprehensive ablations plus Grad-CAM-based interpretability, validated on a large multi-institution dataset. The results demonstrate state-of-the-art accuracy with substantially reduced computation, offering a practical, interpretable tool for clinical pneumonia diagnosis and rapid decision support.

Abstract

Chest imaging plays an essential role in diagnosing and predicting patients with COVID-19 with evidence of worsening respiratory status. Many deep learning-based approaches for pneumonia recognition have been developed to enable computer-aided diagnosis. However, the long training and inference time makes them inflexible, and the lack of interpretability reduces their credibility in clinical medical practice. This paper aims to develop a pneumonia recognition framework with interpretability, which can understand the complex relationship between lung features and related diseases in chest X-ray (CXR) images to provide high-speed analytics support for medical practice. To reduce the computational complexity to accelerate the recognition process, a novel multi-level self-attention mechanism within Transformer has been proposed to accelerate convergence and emphasize the task-related feature regions. Moreover, a practical CXR image data augmentation has been adopted to address the scarcity of medical image data problems to boost the model's performance. The effectiveness of the proposed method has been demonstrated on the classic COVID-19 recognition task using the widespread pneumonia CXR image dataset. In addition, abundant ablation experiments validate the effectiveness and necessity of all of the components of the proposed method.

Interpretable CNN-Multilevel Attention Transformer for Rapid Recognition of Pneumonia from Chest X-Ray Images

TL;DR

Abstract

Paper Structure (21 sections, 33 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 33 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
COVID-19 Recognition with CNN
Self-Attention Mechanism
Vision Transformers (ViTs)
Methodology
CXR Image Feature Fusion Augmentation (FFA)
CXR Image Feature Extractor (CIFE)
Multilevel Mutli-Head Self-Attention Transformer (MMSA-Transformer)
Multi-Head Self-Attention (MHSA)
Multilevel Multi-Head Self-Attention (MMSA)
Computational Complexity Analysis
Experiments
Dataset
Experimental Setup
...and 6 more sections

Figures (7)

Figure 1: Schematic diagram of the proposed CMT, which mainly consists of a CXR image feature fusion augmentation (FFA) module, CXR image feature extractor (CIFE), and Multilevel Multi-Head Self-Attention Transformer (MMSA Transformer). Note that the Shared phase means the training and inference phase.
Figure 2: Schematic diagram of the FFA algorithm. Step 1 presents two images randomly selected from the CXR image dataset with viral pneumonia, Step 2 shows the weight matrix generated by RBS when the patch size is 2, and Step 3 shows the two CXR images generated under different position combinations ($m$=1, $n$=0, 1).
Figure 3: Schematic diagram of the CXR images processing flow and architecture of CIFE. The CXR image features extracted by CIFE are converted into feature embeddings and input to the subsequent Transformer architecture, where "$\times$3", "$\times$4," and "$\times$23," respectively, means that modules were repeated 3, 4, and 23 times.
Figure 4: Architecture of MMSA Transformer, where Add & Norm represent point-wise addition and normalization, MMSA is the proposed Multilevel Multi-Head Attention, and the $N$ denotes the number of Encoder and the repeated times of 'Feed-Forward-Add & Norm' sequence.
Figure 5: Architecture of Multi-Head Self-Attention. (Left) Conventional Multi-Head Self-Attention (MHSA); (Right) The proposed Mutlilevel Multi-Head Self-Attention (MMSA), in which the embedding reshape module revice the feature embedding from CIFE.
...and 2 more figures

Interpretable CNN-Multilevel Attention Transformer for Rapid Recognition of Pneumonia from Chest X-Ray Images

TL;DR

Abstract

Interpretable CNN-Multilevel Attention Transformer for Rapid Recognition of Pneumonia from Chest X-Ray Images

Authors

TL;DR

Abstract

Table of Contents

Figures (7)