Table of Contents
Fetching ...

T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection

Jiazhou Zhou, Qing Jiang, Kanghao Chen, Lutao Jiang, Yuanhuiyi Lyu, Ying-Cong Chen, Lei Zhang

TL;DR

T-Rex-Omni tackles open-set object detection by addressing the shortcomings of positive-only prompts, introducing negative visual prompts to actively suppress hard distractors. The framework features a unified positive-negative visual prompt encoder, a training-free Negating Negative Computing module, and a discriminative Negating Negative Hinge loss, enabling both immediate deployment and fine-tuning. In zero-shot evaluations across COCO, LVIS, ODinW, and Roboflow100, it achieves state-of-the-art results for visual prompts, with notable gains on long-tailed LVIS rare categories and a reduced gap to text-prompt methods. The approach supports three inference modes (user-curated, auto-suggested, and positive-only), offering practical flexibility and real-time viability for diverse applications while advancing robust open-set recognition by leveraging negative visual information.

Abstract

Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.

T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection

TL;DR

T-Rex-Omni tackles open-set object detection by addressing the shortcomings of positive-only prompts, introducing negative visual prompts to actively suppress hard distractors. The framework features a unified positive-negative visual prompt encoder, a training-free Negating Negative Computing module, and a discriminative Negating Negative Hinge loss, enabling both immediate deployment and fine-tuning. In zero-shot evaluations across COCO, LVIS, ODinW, and Roboflow100, it achieves state-of-the-art results for visual prompts, with notable gains on long-tailed LVIS rare categories and a reduced gap to text-prompt methods. The approach supports three inference modes (user-curated, auto-suggested, and positive-only), offering practical flexibility and real-time viability for diverse applications while advancing robust open-set recognition by leveraging negative visual information.

Abstract

Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.

Paper Structure

This paper contains 23 sections, 9 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: T-Rex-Omni employs dual visual prompts to enhance detection precision: positive prompts (e.g., "Chihuahua") guide target object localization while negative prompts (e.g., "muffin") actively suppress visually similar distractors. The joint positive-negative framework enables more discriminative and user-specified object detection.
  • Figure 2: Overview of the T-Rex-Omni model.
  • Figure 3: Ablation study on hyperparameters. (a) $\beta$ in the NNC module; (b) $\eta$ in the NNH loss; (c) positive prompt quantity; (d) negative prompt quantity.
  • Figure 4: Visualization of T-Rex-Omni's three inference modes. (a) Positive-only; (b) Auto-suggested; (c) User-curated.
  • Figure 5: Visual Prompt Generation Process. (a) Randomly shift center; (b) Randomly shift size; (c) Randomly shift size and shift center.
  • ...and 6 more figures