Beyond Conventional Transformers: The Medical X-ray Attention (MXA) Block for Improved Multi-Label Diagnosis Using Knowledge Distillation
Amit Rand, Hadi Ibrahim
TL;DR
This paper tackles the challenge of multi-label chest X-ray diagnosis under constrained compute by introducing the Medical X-ray Attention (MXA) block, which fuses dynamic ROI pooling and CBAM-style attention in parallel with MHSA within EfficientViT. It further strengthens learning via a soft, dynamic knowledge-distillation framework from a frozen DenseNet-121 teacher, tailored for 14 CheXpert findings. Empirically, EfficientViT with MXA and KD achieves a mean ROC-AUC of 0.85 on CheXpert, a substantial absolute gain of 0.19 over a strong EfficientViT baseline, while maintaining a practical inference cost. The approach demonstrates that task-specific attention combined with distillation can bridge the gap between general-purpose transformers and radiologist-driven requirements, with potential deployment in point-of-care settings and applicability to other medical imaging modalities.
Abstract
Medical imaging, particularly X-ray analysis, often involves detecting multiple conditions simultaneously within a single scan, making multi-label classification crucial for real-world clinical applications. We present the Medical X-ray Attention (MXA) block, a novel attention mechanism tailored specifically to address the unique challenges of X-ray abnormality detection. The MXA block enhances traditional Multi-Head Self Attention (MHSA) by integrating a specialized module that efficiently captures both detailed local information and broader global context. To the best of our knowledge, this is the first work to propose a task-specific attention mechanism for diagnosing chest X-rays, as well as to attempt multi-label classification using an Efficient Vision Transformer (EfficientViT). By embedding the MXA block within the EfficientViT architecture and employing knowledge distillation, our proposed model significantly improves performance on the CheXpert dataset, a widely used benchmark for multi-label chest X-ray abnormality detection. Our approach achieves an area under the curve (AUC) of 0.85, an absolute improvement of 0.19 compared to our baseline model's AUC of 0.66, corresponding to a substantial approximate 233% relative improvement over random guessing (AUC = 0.5).
