T-Rex: Counting by Visual Prompting
Qing Jiang, Feng Li, Tianhe Ren, Shilong Liu, Zhaoyang Zeng, Kent Yu, Lei Zhang
TL;DR
T-Rex reframes object counting as open-set detection guided by visual prompts, enabling interactive, feedback-driven counting without predefined categories. It uses a lightweight prompt-encoder and box-decoder on top of a vision encoder to locate pattern-matching instances in a target image, producing counts through thresholded detections. The authors introduce CA-44, a diverse benchmarking suite, and demonstrate state-of-the-art performance on FSC147/FSCD-LVIS with strong zero-shot capabilities, complemented by interactive refinement and cross-image prompting. The work suggests a practical, versatile counting paradigm with broad applicability across domains and potential integration with segmentation tools for visualization.
Abstract
We introduce T-Rex, an interactive object counting model designed to first detect and then count any objects. We formulate object counting as an open-set object detection task with the integration of visual prompts. Users can specify the objects of interest by marking points or boxes on a reference image, and T-Rex then detects all objects with a similar pattern. Guided by the visual feedback from T-Rex, users can also interactively refine the counting results by prompting on missing or falsely-detected objects. T-Rex has achieved state-of-the-art performance on several class-agnostic counting benchmarks. To further exploit its potential, we established a new counting benchmark encompassing diverse scenarios and challenges. Both quantitative and qualitative results show that T-Rex possesses exceptional zero-shot counting capabilities. We also present various practical application scenarios for T-Rex, illustrating its potential in the realm of visual prompting.
