Table of Contents
Fetching ...

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Isaac Robinson, Peter Robicheaux, Matvei Popov, Deva Ramanan, Neehar Peri

TL;DR

Open-vocabulary detectors often struggle to generalize to real-world, out-of-distribution classes, while real-time specialist detectors lag in accuracy. RF-DETR combines internet-scale foundation-model priors with end-to-end weight-sharing neural architecture search to automatically discover Pareto-optimal accuracy-latency configurations on a given target dataset without retraining sub-nets. Key contributions include a scheduler-free NAS pipeline over a multi-knob search space (patch size, decoder layers, queries, image resolution, and windowed attention), a DINOv2-based backbone with a lightweight segmentation head (RF-DETR-Seg), and a standardized latency benchmarking protocol; RF-DETR demonstrates state-of-the-art real-time performance on COCO and RF100-VL, including surpassing 60 AP on COCO with the 2x-large variant. The work highlights the value of dataset- and hardware-aware NAS for robust transfer to diverse domains, while also showing the benefits of architecture augmentation and pre-training, and calling for broader, reproducible latency benchmarks across models.

Abstract

Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the "tunable knobs" for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at https://github.com/roboflow/rf-detr

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

TL;DR

Open-vocabulary detectors often struggle to generalize to real-world, out-of-distribution classes, while real-time specialist detectors lag in accuracy. RF-DETR combines internet-scale foundation-model priors with end-to-end weight-sharing neural architecture search to automatically discover Pareto-optimal accuracy-latency configurations on a given target dataset without retraining sub-nets. Key contributions include a scheduler-free NAS pipeline over a multi-knob search space (patch size, decoder layers, queries, image resolution, and windowed attention), a DINOv2-based backbone with a lightweight segmentation head (RF-DETR-Seg), and a standardized latency benchmarking protocol; RF-DETR demonstrates state-of-the-art real-time performance on COCO and RF100-VL, including surpassing 60 AP on COCO with the 2x-large variant. The work highlights the value of dataset- and hardware-aware NAS for robust transfer to diverse domains, while also showing the benefits of architecture augmentation and pre-training, and calling for broader, reproducible latency benchmarks across models.

Abstract

Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the "tunable knobs" for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at https://github.com/roboflow/rf-detr

Paper Structure

This paper contains 14 sections, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Accuracy-Latency Pareto Curve. We plot the Pareto accuracy-latency frontier for real-time detectors on the COCO detection val-set (top left, bottom left), COCO segmentation val-set (top right), and RF100-VL test-set (bottom right). Since RF100-VL contains 100 distinct datasets, we select target latencies for the N, S, M, L, XL, 2XL configurations, search for RF-DETR models with latencies within 10% of the target and report their average performance after fine-tuning to convergence. Importantly, all points along RF-DETR's continuous Pareto curves for COCO are derived from a single training run.
  • Figure 2: RF-DETR Architecture. RF-DETR uses a pre-trained ViT backbone to extract multi-scale features of the input image. We interleave windowed and non-windowed attention blocks to balance accuracy and latency. Notably, the deformable cross-attention layer and segmentation head both bilinearly interpolate the the output of the projector, allowing for consistent spatial organization of features. Lastly, we apply detection and segmentation losses at all decoder layers to facilitate decoder drop out at inference.
  • Figure 3: NAS Search Space. We vary (a) patch size, (b) number of decoder layers, (c) number of queries, (d) image resolution, and (e) number of windows per attention block in our weight-sharing NAS. In addition to training thousands of network configurations in parallel, we find that this "architecture augmentation" serves as a regularizer and improves generalization.
  • Figure 4: Impact of Decoder Layers vs. Query Tokens. We evaluate the impact of inference-time query dropping for trading-off accuracy and latency in RF-DETR (nano). Interestingly, we find that dropping the 100 lowest confidence queries does not significantly reduce performance, but modestly improves latency for all decoder layers.
  • Figure 5: Ablating Fixed Architecture RF100-VL. We evaluate the benefit of dataset-specific NAS by transferring the COCO-optimized RF-DETR architecture to RF100-VL. Although the fixed architecture was not tuned for RF100-VL, it still outperforms LW-DETR. Running NAS directly on RF100-VL further improves performance over the fixed architecture. Additional fine-tuning provides consistent gains across all model sizes, with particularly strong improvements for smaller models. This is consistent with our observations on COCO object detection.
  • ...and 1 more figures