Table of Contents
Fetching ...

Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection

Mehmet Kerem Turkcan

Abstract

Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.

Detect Anything in Real Time: From Single-Prompt Segmentation to Multi-Class Detection

Abstract

Recent advances in vision-language modeling have produced promptable detection and segmentation systems that accept arbitrary natural language queries at inference time. Among these, SAM3 achieves state-of-the-art accuracy by combining a ViT-H/14 backbone with cross-modal transformer decoding and learned object queries. However, SAM3 processes a single text prompt per forward pass. Detecting N categories requires N independent executions, each dominated by the 439M-parameter backbone. We present Detect Anything in Real Time (DART), a training-free framework that converts SAM3 into a real-time multi-class detector by exploiting a structural invariant: the visual backbone is class-agnostic, producing image features independent of the text prompt. This allows the backbone computation to be shared between all classes, reducing its cost from O(N) to O(1). Combined with batched multi-class decoding, detection-only inference, and TensorRT FP16 deployment, these optimizations yield 5.6x cumulative speedup at 3 classes, scaling to 25x at 80 classes, without modifying any model weight. On COCO val2017 (5,000 images, 80 classes), DART achieves 55.8 AP at 15.8 FPS (4 classes, 1008x1008) on a single RTX 4080, surpassing purpose-built open-vocabulary detectors trained on millions of box annotations. For extreme latency targets, adapter distillation with a frozen encoder-decoder achieves 38.7 AP with a 13.9 ms backbone. Code and models are available at https://github.com/mkturkcan/DART.
Paper Structure (31 sections, 1 equation, 4 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: From SAM3 to DART.(a) SAM3 runs the full pipeline once per class. The backbone accounts for 78% of compute and is repeated $N$ times despite being class-independent. (b) DART shares backbone features across all classes, batches the encoder-decoder, removes the mask head, and deploys both stages as TRT FP16 engines. The backbone uses restructured attention (explicit operations, real-valued RoPE) to enable correct FP16 export (§\ref{['sec:graph']}). (c) For video, backbone and encoder-decoder run on separate CUDA streams: the encoder-decoder for frame $t$ overlaps with the backbone for frame $t{+}1$, reducing per-frame latency from 70 ms to 53 ms (4 classes, 1008px).
  • Figure 2: FPS vs. class count. At 644px, all tested class counts exceed 30 FPS pipelined. At 1008px, up to 4 classes reach $\geq$15 FPS. Pipelining provides 8-15% improvement with diminishing returns as encoder-decoder cost approaches backbone cost.
  • Figure 3: Qualitative results. DART detections on COCO val2017 images using 6 open-vocabulary classes (person, car, dog, bicycle, chair, cat) at 1008px resolution with confidence threshold 0.45. All optimizations are structural: DART produces identical outputs to the per-class SAM3 baseline.
  • Figure 4: Speed-quality Pareto front. Numbers indicate class count. Teacher ViT-H dominates quality (55.8 AP) but is limited to $<$19 FPS. Distilled students reach 45--51 FPS at 4 classes, trading AP for 2.5--3$\times$ throughput. Among student backbones, RepViT-M2.3 achieves the highest AP (38.7) at 45 FPS.