LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang
TL;DR
LW-DETR presents a lightweight DETR-based real-time detector that rivals and often surpasses state-of-the-art CNN detectors. By using a plain ViT encoder connected through a convolutional projector to a compact, deformable DETR decoder, and by applying multi-level feature aggregation, interleaved window/global attention, and a window-major feature map organization, the approach achieves strong accuracy with low latency. The authors demonstrate substantial gains from Objects365 pretraining, IOU-aware loss, and targeted training strategies, with five model scales offering a wide speed-accuracy spectrum. Extensive experiments on COCO, UVO, and RF100 show LW-DETR outperforms concurrent detectors across multiple domains, highlighting its practicality for real-time vision tasks. The work provides a simple, efficient baseline for transformer-based real-time detection with broad applicability.
Abstract
In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).
