RTGen: Real-Time Generative Detection Transformer
Chi Ruan, Jiying Zhao, Wenhu Chen
TL;DR
RTGen tackles open-ended object detection without relying on large vision–language pretraining by unifying detection and category-name generation in a single transformer framework. It introduces a Region-Language Decoder that jointly refines visual queries and textual embeddings, and a DAG-based, non-autoregressive text head for parallel category-name generation, enabling real-time inference. With RT-DETR as its backbone, RTGen achieves up to 131.3 FPS on T4 GPUs while delivering competitive COCO AP and strong zero-shot generalization, all without relying on CLIP or pretrained language models. The work highlights practical open-ended detection by tightly integrating cross-modal reasoning, though vocabulary remains bounded by the training dataset’s category set and may limit broad linguistic deployment.
Abstract
Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.
