Table of Contents
Fetching ...

Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation

Amit Rand, Hadi Ibrahim

TL;DR

This paper tackles the challenge of multi-label chest X-ray diagnosis under constrained compute by introducing the Medical X-ray Attention (MXA) block, which fuses dynamic ROI pooling and CBAM-style attention in parallel with MHSA within EfficientViT. It further strengthens learning via a soft, dynamic knowledge-distillation framework from a frozen DenseNet-121 teacher, tailored for 14 CheXpert findings. Empirically, EfficientViT with MXA and KD achieves a mean ROC-AUC of 0.85 on CheXpert, a substantial absolute gain of 0.19 over a strong EfficientViT baseline, while maintaining a practical inference cost. The approach demonstrates that task-specific attention combined with distillation can bridge the gap between general-purpose transformers and radiologist-driven requirements, with potential deployment in point-of-care settings and applicability to other medical imaging modalities.

Abstract

Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).

Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation

TL;DR

This paper tackles the challenge of multi-label chest X-ray diagnosis under constrained compute by introducing the Medical X-ray Attention (MXA) block, which fuses dynamic ROI pooling and CBAM-style attention in parallel with MHSA within EfficientViT. It further strengthens learning via a soft, dynamic knowledge-distillation framework from a frozen DenseNet-121 teacher, tailored for 14 CheXpert findings. Empirically, EfficientViT with MXA and KD achieves a mean ROC-AUC of 0.85 on CheXpert, a substantial absolute gain of 0.19 over a strong EfficientViT baseline, while maintaining a practical inference cost. The approach demonstrates that task-specific attention combined with distillation can bridge the gap between general-purpose transformers and radiologist-driven requirements, with potential deployment in point-of-care settings and applicability to other medical imaging modalities.

Abstract

Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).

Paper Structure

This paper contains 40 sections, 14 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Integration of the Medical X-ray Attention (MXA) Block Architecture. The MXA block injects ROI pooling + CBAM gating in parallel with MHSA, focusing compute on abnormal regions.
  • Figure 2: Training AUC over 30 epochs for baseline and proposed models, showing stable convergence and a consistent performance margin. The improved model converges faster and sustains 0.19 higher AUC than the naive baseline throughout training, evidencing durable gains.
  • Figure 3: (a) Training AUC over 30 epochs. (b) Final-epoch AUC for each ablation. MXA alone delivers the largest jump in AUC; adding KD yields an additional boost, lifting performance from $0.66 \rightarrow 0.85$
  • Figure 4: After 25 training epochs, an inference pass of the improved model with the MXA yields more focused and clinically meaningful attention on a CXR with pneumonia. Each heat-map pixel is the normalized attention score for that image patch. Bright yellow in the Naive/Improved panels = high attention; dark purple = low attention. The Delta panel shows the difference in attention: red indicates regions where MXA attends less than naive MHSA, blue where it attends more, and white means no change. MXA suppresses spurious focus on the shoulders while amplifying attention over the lower left lung field where consolidation is visible, mirroring radiologist practice.
  • Figure F.1: Demonstration of the MXA block. MXA consistently highlights clinically relevant regions across eight patients, confirming its ability to localize subtle abnormalities. Each panel (pt1–pt8) shows the region of interest automatically pooled after an initial inference on chest X‑rays. Red boxes mark MXA‑predicted ROIs. Additional pathology metadata appear in Table \ref{['tab:mxa_metadata']}.
  • ...and 3 more figures