CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays
Yefeng Wu, Yuchen Song, Ling Wu, Shan Wan, Yecheng Zhao
TL;DR
The paper tackles automated pneumonia detection in chest X-rays by adapting a real-time transformer detector (RT-DETR) with three specialized modules. XFABlock enhances backbone multi-scale feature extraction through convolutional attention within a CSP framework, SPGA enables efficient, gated single-head attention for feature fusion, and GCFC3 provides rich multi-scale representation in the neck with training-time multi-path fusion and inference-time structural re-parameterization. On the RSNA Pneumonia Detection dataset, CGF-DETR achieves $mAP@0.5=82.2\%$ and $mAP@[0.5:0.95]=50.4\%$, while maintaining real-time performance at approximately 48 FPS and reduced latency relative to the baseline. Ablation studies confirm that each module contributes meaningfully, with notable synergistic gains when combined, highlighting the approach's potential for accurate, fast detection of subtle pneumonia patterns in clinical CT/X-ray workflows. The work suggests that domain-specific architectural adaptations in backbone, encoder, and neck can substantially improve both accuracy and speed in medical image detectors, with implications for broader applications beyond pneumonia detection.
Abstract
Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]
