Table of Contents
Fetching ...

RTGen: Real-Time Generative Detection Transformer

Chi Ruan, Jiying Zhao, Wenhu Chen

TL;DR

RTGen tackles open-ended object detection without relying on large vision–language pretraining by unifying detection and category-name generation in a single transformer framework. It introduces a Region-Language Decoder that jointly refines visual queries and textual embeddings, and a DAG-based, non-autoregressive text head for parallel category-name generation, enabling real-time inference. With RT-DETR as its backbone, RTGen achieves up to 131.3 FPS on T4 GPUs while delivering competitive COCO AP and strong zero-shot generalization, all without relying on CLIP or pretrained language models. The work highlights practical open-ended detection by tightly integrating cross-modal reasoning, though vocabulary remains bounded by the training dataset’s category set and may limit broad linguistic deployment.

Abstract

Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.

RTGen: Real-Time Generative Detection Transformer

TL;DR

RTGen tackles open-ended object detection without relying on large vision–language pretraining by unifying detection and category-name generation in a single transformer framework. It introduces a Region-Language Decoder that jointly refines visual queries and textual embeddings, and a DAG-based, non-autoregressive text head for parallel category-name generation, enabling real-time inference. With RT-DETR as its backbone, RTGen achieves up to 131.3 FPS on T4 GPUs while delivering competitive COCO AP and strong zero-shot generalization, all without relying on CLIP or pretrained language models. The work highlights practical open-ended detection by tightly integrating cross-modal reasoning, though vocabulary remains bounded by the training dataset’s category set and may limit broad linguistic deployment.

Abstract

Although open-vocabulary object detectors can generalize to unseen categories, they still rely on predefined textual prompts or classifier heads during inference. Recent generative object detectors address this limitation by coupling an autoregressive language model with a detector backbone, enabling direct category name generation for each detected object. However, this straightforward design introduces structural redundancy and substantial latency. In this paper, we propose a Real-Time Generative Detection Transformer (RTGen), a real-time generative object detector with a succinct encoder-decoder architecture. Specifically, we introduce a novel Region-Language Decoder (RL-Decoder) that jointly decodes visual and textual representations within a unified framework. The textual side is organized as a Directed Acyclic Graph (DAG), enabling non-autoregressive category naming. Benefiting from these designs, RTGen-R34 achieves 131.3 FPS on T4 GPUs, over 270x faster than GenerateU. Moreover, our models learn to generate category names directly from detection labels, without relying on external supervision such as CLIP or pretrained language models, achieving efficient and flexible open-ended detection.

Paper Structure

This paper contains 17 sections, 4 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of different object detection paradigms. The open-vocabulary detector (a) relies on predefined textual prompts and text encoding; previous generative detectors (b) attach an autoregressive text generator to the detector; our real-time generative detector (c) unifies detection and text generation within a single framework, generating category names efficiently without predefined prompts or external language models.
  • Figure 2: Overall architecture of RTGen. RTGen builds upon RT-DETR by introducing a unified Region–Language Decoder (RL-Decoder) that jointly processes object queries and positional text embeddings. The refined queries and text features are sent to a detection head and the DAG Text Head, enabling efficient real-time open-ended detection.
  • Figure 3: Structure of the proposed DAG Text Head. It estimates token transition probabilities via the Link Prediction Module and constructs a directed acyclic graph for non-autoregressive text generation.
  • Figure 4: We report AP, AP50, and AP75 on the COCO validation set using RTGen-R50 trained on COCO. The results show that using 8 text tokens achieves the best overall performance across all three metrics.
  • Figure 5: Visualization results from RTGen-R101, trained on Objects365 and inferred in a zero-shot setting on the COCO val.