Table of Contents
Fetching ...

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang

TL;DR

LW-DETR presents a lightweight DETR-based real-time detector that rivals and often surpasses state-of-the-art CNN detectors. By using a plain ViT encoder connected through a convolutional projector to a compact, deformable DETR decoder, and by applying multi-level feature aggregation, interleaved window/global attention, and a window-major feature map organization, the approach achieves strong accuracy with low latency. The authors demonstrate substantial gains from Objects365 pretraining, IOU-aware loss, and targeted training strategies, with five model scales offering a wide speed-accuracy spectrum. Extensive experiments on COCO, UVO, and RF100 show LW-DETR outperforms concurrent detectors across multiple domains, highlighting its practicality for real-time vision tasks. The work provides a simple, efficient baseline for transformer-based real-time detection with broad applicability.

Abstract

In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

TL;DR

LW-DETR presents a lightweight DETR-based real-time detector that rivals and often surpasses state-of-the-art CNN detectors. By using a plain ViT encoder connected through a convolutional projector to a compact, deformable DETR decoder, and by applying multi-level feature aggregation, interleaved window/global attention, and a window-major feature map organization, the approach achieves strong accuracy with low latency. The authors demonstrate substantial gains from Objects365 pretraining, IOU-aware loss, and targeted training strategies, with five model scales offering a wide speed-accuracy spectrum. Extensive experiments on COCO, UVO, and RF100 show LW-DETR outperforms concurrent detectors across multiple domains, highlighting its practicality for real-time vision tasks. The work provides a simple, efficient baseline for transformer-based real-time detection with broad applicability.

Abstract

In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).
Paper Structure (20 sections, 6 equations, 5 figures, 12 tables)

This paper contains 20 sections, 6 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Our approach outperforms previous SoTA real-time detectors. The x-axis corresponds to the inference time. The y-axis corresponds to the mAP score on COCO val2017. All the models are trained with pretraining on Objects365. The NMS post-processing times are included for other models and measured on the COCO val2017 with the setting from the official implementation lyu2022rtmdetyolov8_ultralyticssupergradients, and the well-tuned NMS postprocessing setting (labeled as "*").
  • Figure 2: An example of transformer encoder with multi-level feature map aggregation and interleaved window and global attentions. The FFN and LayerNorm layers are not depicted for clarification.
  • Figure 3: Single-scale projector and multi-scale projector for (a) the tiny, small, and medium models, and (b) the large and xlarge models.
  • Figure 4: Our approach outperforms concurrent works. The x-axis corresponds to the inference time. The y-axis corresponds to the mAP score on COCO val2017. Our LW-DETR, RT-DETR lv2023detrs, YOLO-MS chen2023yolo, and Gold-YOLO wang2023gold are trained with pretraining on Objects365, while YOLOv10 wang2024yolov10 is not. The NMS post-processing times are included for YOLO-MS and Gold-YOLO, and measured on the COCO val2017 with the setting from the official implementation, and the well-tuned NMS postprocessing setting (labeled as "*").
  • Figure 5: Distribution of the number of boxes. The x-axis corresponds to the number of boxes that are fed into NMS. The y-axis corresponds to the number of images on COCO val2017 whose remaining box numbers are in the corresponding interval. (a) is under the default score threshold. (b) is tuning the score threshold to get the balance between detection performance and latency. (c) is tuning a higher score threshold.