Table of Contents
Fetching ...

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, Lei Zhang

TL;DR

T-Rex2 tackles open-set object detection by unifying text and visual prompts within a DETR-based framework. It introduces a region-level contrastive alignment to mutually refine text and visual prompts, enabling robust zero-shot detection across diverse benchmarks. The model supports four inference workflows—text, interactive visual prompts, generic visual prompts, and a mixed mode—while leveraging data engines that combine labeled and pseudo-labeled data. The results show complementary strengths: text prompts excel on common categories, while visual prompts handle long-tailed and rare objects, advancing toward generic object detection with practical interactive capabilities.

Abstract

We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

TL;DR

T-Rex2 tackles open-set object detection by unifying text and visual prompts within a DETR-based framework. It introduces a region-level contrastive alignment to mutually refine text and visual prompts, enabling robust zero-shot detection across diverse benchmarks. The model supports four inference workflows—text, interactive visual prompts, generic visual prompts, and a mixed mode—while leveraging data engines that combine labeled and pseudo-labeled data. The results show complementary strengths: text prompts excel on common categories, while visual prompts handle long-tailed and rare objects, advancing toward generic object detection with practical interactive capabilities.

Abstract

We present T-Rex2, a highly practical model for open-set object detection. Previous open-set object detection methods relying on text prompts effectively encapsulate the abstract concept of common objects, but struggle with rare or complex object representation due to data scarcity and descriptive limitations. Conversely, visual prompts excel in depicting novel objects through concrete visual examples, but fall short in conveying the abstract concept of objects as effectively as text prompts. Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning. T-Rex2 accepts inputs in diverse formats, including text prompts, visual prompts, and the combination of both, so that it can handle different scenarios by switching between the two prompt modalities. Comprehensive experiments demonstrate that T-Rex2 exhibits remarkable zero-shot object detection capabilities across a wide spectrum of scenarios. We show that text prompts and visual prompts can benefit from each other within the synergy, which is essential to cover massive and complicated real-world scenarios and pave the way towards generic object detection. Model API is now available at \url{https://github.com/IDEA-Research/T-Rex}.
Paper Structure (30 sections, 12 equations, 10 figures, 10 tables)

This paper contains 30 sections, 12 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Workflow of the proposed image prompt data engine.
  • Figure 2: Long-tailed curve of object frequency and the number of categories that can be detected. We suggest that the text prompt can cover the middle part of the long-tailed curve, while the visual prompt can cover the tail.
  • Figure 2: Examples in DetSA-1B.
  • Figure 3: Overview of the T-Rex2 model. T-Rex2 mainly follows the design principles of DETR carion2020end which is an end-to-end object detection model. Visual prompt and text prompt are introduced through deformable cross attention zhu2020deformable and CLIP clip text encoder, respectively, and are aligned through contrastive learning.
  • Figure 3: Visualization results of region classification workflow. We use a dictionary of 2560 classes to classify the visual prompts. The classification result is shown at the bottom right for each image.
  • ...and 5 more figures