T-Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection
Jiazhou Zhou, Qing Jiang, Kanghao Chen, Lutao Jiang, Yuanhuiyi Lyu, Ying-Cong Chen, Lei Zhang
TL;DR
T-Rex-Omni tackles open-set object detection by addressing the shortcomings of positive-only prompts, introducing negative visual prompts to actively suppress hard distractors. The framework features a unified positive-negative visual prompt encoder, a training-free Negating Negative Computing module, and a discriminative Negating Negative Hinge loss, enabling both immediate deployment and fine-tuning. In zero-shot evaluations across COCO, LVIS, ODinW, and Roboflow100, it achieves state-of-the-art results for visual prompts, with notable gains on long-tailed LVIS rare categories and a reduced gap to text-prompt methods. The approach supports three inference modes (user-curated, auto-suggested, and positive-only), offering practical flexibility and real-time viability for diverse applications while advancing robust open-set recognition by leveraging negative visual information.
Abstract
Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.
