RF-DETR: Neural Architecture Search for Real-Time Detection Transformers
Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, Neehar Peri
TL;DR
Open-vocabulary detectors often struggle to generalize to real-world, out-of-distribution classes, while real-time specialist detectors lag in accuracy. RF-DETR combines internet-scale foundation-model priors with end-to-end weight-sharing neural architecture search to automatically discover Pareto-optimal accuracy-latency configurations on a given target dataset without retraining sub-nets. Key contributions include a scheduler-free NAS pipeline over a multi-knob search space (patch size, decoder layers, queries, image resolution, and windowed attention), a DINOv2-based backbone with a lightweight segmentation head (RF-DETR-Seg), and a standardized latency benchmarking protocol; RF-DETR demonstrates state-of-the-art real-time performance on COCO and RF100-VL, including surpassing 60 AP on COCO with the 2x-large variant. The work highlights the value of dataset- and hardware-aware NAS for robust transfer to diverse domains, while also showing the benefits of architecture augmentation and pre-training, and calling for broader, reproducible latency benchmarks across models.
Abstract
Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the "tunable knobs" for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at https://github.com/roboflow/rf-detr
