Table of Contents
Fetching ...

T-Rex: Counting by Visual Prompting

Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang

TL;DR

T-Rex reframes object counting as open-set detection guided by visual prompts, enabling interactive, feedback-driven counting without predefined categories. It uses a lightweight prompt-encoder and box-decoder on top of a vision encoder to locate pattern-matching instances in a target image, producing counts through thresholded detections. The authors introduce CA-44, a diverse benchmarking suite, and demonstrate state-of-the-art performance on FSC147/FSCD-LVIS with strong zero-shot capabilities, complemented by interactive refinement and cross-image prompting. The work suggests a practical, versatile counting paradigm with broad applicability across domains and potential integration with segmentation tools for visualization.

Abstract

We introduce T-Rex, an interactive object counting model designed to first detect and then count any objects. We formulate object counting as an open-set object detection task with the integration of visual prompts. Users can specify the objects of interest by marking points or boxes on a reference image, and T-Rex then detects all objects with a similar pattern. Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects. T-Rex has achieved state-of-the-art performance on several class-agnostic counting benchmarks. To further exploit its potential, we established a new counting benchmark encompassing diverse scenarios and challenges. Both quantitative and qualitative results show that T-Rex possesses exceptional zero-shot counting capabilities. We also present various practical application scenarios for T-Rex, illustrating its potential in the realm of visual prompting.

T-Rex: Counting by Visual Prompting

TL;DR

T-Rex reframes object counting as open-set detection guided by visual prompts, enabling interactive, feedback-driven counting without predefined categories. It uses a lightweight prompt-encoder and box-decoder on top of a vision encoder to locate pattern-matching instances in a target image, producing counts through thresholded detections. The authors introduce CA-44, a diverse benchmarking suite, and demonstrate state-of-the-art performance on FSC147/FSCD-LVIS with strong zero-shot capabilities, complemented by interactive refinement and cross-image prompting. The work suggests a practical, versatile counting paradigm with broad applicability across domains and potential integration with segmentation tools for visualization.

Abstract

We introduce T-Rex, an interactive object counting model designed to first detect and then count any objects. We formulate object counting as an open-set object detection task with the integration of visual prompts. Users can specify the objects of interest by marking points or boxes on a reference image, and T-Rex then detects all objects with a similar pattern. Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects. T-Rex has achieved state-of-the-art performance on several class-agnostic counting benchmarks. To further exploit its potential, we established a new counting benchmark encompassing diverse scenarios and challenges. Both quantitative and qualitative results show that T-Rex possesses exceptional zero-shot counting capabilities. We also present various practical application scenarios for T-Rex, illustrating its potential in the realm of visual prompting.
Paper Structure (13 sections, 7 equations, 22 figures, 4 tables)

This paper contains 13 sections, 7 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: We introduce an interactive object counting model, T-Rex. Given boxes or points specified on the reference image, T-Rex can detect all instances on the target image that exhibit similar pattern with the specified object, which are then summed to obtain the counting result. We use SAMKirillov_2023_ICCV to generate mask prompted on the detected boxes by T-Rex for better visualization.
  • Figure 2: T-Rex is an object counting model, which is characterized by four features: detection-based, visual promptable, interative, and open-set. Listed methods are: Grounding DINO liu2023grounding, GLIP li2022grounded,Semantic-SAM li2023semantic, SEEM zou2023segment, SAM Kirillov_2023_ICCV, UniPose yang2023unipose, MQ-Det xu2023multi, OWL-ViT minderer2022simple, DINOv li2023visual.
  • Figure 3: Overview of the T-Rex model. T-Rex is a detection-based model comprising an image encoder to extract image feature, a prompt encoder to encode visual prompts (points or boxes) provided by users, and a box decoder to output the detected boxes.
  • Figure 4: T-Rex offers three major interactive workflows, which are applicable to most scenarios in real-world applications.
  • Figure 5: An Overview of the proposed CA-44 benchmark. CA-44 consists of 44 datasets across eight domains and mainly comprises images with small and densely packed objects.
  • ...and 17 more figures