Table of Contents
Fetching ...

CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays

Yefeng Wu, Yuchen Song, Ling Wu, Shan Wan, Yecheng Zhao

TL;DR

The paper tackles automated pneumonia detection in chest X-rays by adapting a real-time transformer detector (RT-DETR) with three specialized modules. XFABlock enhances backbone multi-scale feature extraction through convolutional attention within a CSP framework, SPGA enables efficient, gated single-head attention for feature fusion, and GCFC3 provides rich multi-scale representation in the neck with training-time multi-path fusion and inference-time structural re-parameterization. On the RSNA Pneumonia Detection dataset, CGF-DETR achieves $mAP@0.5=82.2\%$ and $mAP@[0.5:0.95]=50.4\%$, while maintaining real-time performance at approximately 48 FPS and reduced latency relative to the baseline. Ablation studies confirm that each module contributes meaningfully, with notable synergistic gains when combined, highlighting the approach's potential for accurate, fast detection of subtle pneumonia patterns in clinical CT/X-ray workflows. The work suggests that domain-specific architectural adaptations in backbone, encoder, and neck can substantially improve both accuracy and speed in medical image detectors, with implications for broader applications beyond pneumonia detection.

Abstract

Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]

CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays

TL;DR

The paper tackles automated pneumonia detection in chest X-rays by adapting a real-time transformer detector (RT-DETR) with three specialized modules. XFABlock enhances backbone multi-scale feature extraction through convolutional attention within a CSP framework, SPGA enables efficient, gated single-head attention for feature fusion, and GCFC3 provides rich multi-scale representation in the neck with training-time multi-path fusion and inference-time structural re-parameterization. On the RSNA Pneumonia Detection dataset, CGF-DETR achieves and , while maintaining real-time performance at approximately 48 FPS and reduced latency relative to the baseline. Ablation studies confirm that each module contributes meaningfully, with notable synergistic gains when combined, highlighting the approach's potential for accurate, fast detection of subtle pneumonia patterns in clinical CT/X-ray workflows. The work suggests that domain-specific architectural adaptations in backbone, encoder, and neck can substantially improve both accuracy and speed in medical image detectors, with implications for broader applications beyond pneumonia detection.

Abstract

Pneumonia remains a leading cause of morbidity and mortality worldwide, necessitating accurate and efficient automated detection systems. While recent transformer-based detectors like RT-DETR have shown promise in object detection tasks, their application to medical imaging, particularly pneumonia detection in chest X-rays, remains underexplored. This paper presents CGF-DETR, an enhanced real-time detection transformer specifically designed for pneumonia detection. We introduce XFABlock in the backbone to improve multi-scale feature extraction through convolutional attention mechanisms integrated with CSP architecture. To achieve efficient feature aggregation, we propose SPGA module that replaces standard multi-head attention with dynamic gating mechanisms and single-head self-attention. Additionally, GCFC3 is designed for the neck to enhance feature representation through multi-path convolution fusion while maintaining real-time performance via structural re-parameterization. Extensive experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while maintaining comparable inference speed at 48.1 FPS. Our ablation studies confirm that each proposed module contributes meaningfully to the overall performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overall architecture of CGF-DETR. The backbone integrates XFABlock, the encoder leverages SPGA, the neck adopts GCFC3, and detection heads follow the RT-DETR design.
  • Figure 2: Architecture of XFABlock. XFABlocks apply convolutional attention with residual and FFN branches inside a CSP-style structure to capture multi-scale context.
  • Figure 3: SPGA module. A narrow branch applies single-head attention with dynamic sparsity while a wide bypass preserves high-frequency information before recombining. The gating network dynamically controls attention sparsity based on input content.
  • Figure 4: GCFC3 module architecture. During training, multiple parallel convolution paths with diverse kernel configurations capture multi-scale features. During inference, these paths are structurally re-parameterized into a simplified form for computational efficiency.
  • Figure 5: Visualization comparison between CGF-DETR and RT-DETR.