Follow Anything: Open-set detection, tracking, and following in real-time

Alaa Maalouf; Ninad Jadhav; Krishna Murthy Jatavallabhula; Makram Chahine; Daniel M. Vogt; Robert J. Wood; Antonio Torralba; Daniela Rus

Follow Anything: Open-set detection, tracking, and following in real-time

Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallabhula, Makram Chahine, Daniel M. Vogt, Robert J. Wood, Antonio Torralba, Daniela Rus

TL;DR

Follow Anything (FAn) tackles the problem of real-time open-set object following in robotics by enabling detection, tracking, and following of any object defined through multimodal queries. The approach fuses segmentation from SAM with per-pixel features from DINO and CLIP, representing objects via region descriptors $v_j$ and measuring similarity to a query descriptor $v$ with $\cos(v_j,v) = \frac{v_j^T v}{\left\lVert v_j\right\rVert \left\lVert v\right\rVert + \varepsilon}$, enabling open-vocabulary detection and multi-target labeling in video streams. Key contributions include an open-vocabulary detection pipeline, hardware-conscious speedups (coarse DINO-based detections, quantization, tracing, and FastSAM), a per-pixel feature extraction strategy, and a robust re-detection mechanism that stores cross-trajectory ViT descriptors to recover from tracking loss. Experiments on a quadrotor demonstrate real-time following at $6$ to $20$ FPS on an $8$ GB GPU, effective occlusion handling, and zero-shot detection across diverse objects, with the code openly released for adoption and extension.

Abstract

Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader to watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .

Follow Anything: Open-set detection, tracking, and following in real-time

TL;DR

and measuring similarity to a query descriptor

with

, enabling open-vocabulary detection and multi-target labeling in video streams. Key contributions include an open-vocabulary detection pipeline, hardware-conscious speedups (coarse DINO-based detections, quantization, tracing, and FastSAM), a per-pixel feature extraction strategy, and a robust re-detection mechanism that stores cross-trajectory ViT descriptors to recover from tracking loss. Experiments on a quadrotor demonstrate real-time following at

FPS on an

GB GPU, effective occlusion handling, and zero-shot detection across diverse objects, with the code openly released for adoption and extension.

Abstract

Paper Structure (13 sections, 6 equations, 11 figures, 5 tables)

This paper contains 13 sections, 6 equations, 11 figures, 5 tables.

Introduction
Our Contributions
Our Approach: FAn
Real-time open-vocabulary object detection
Fast detection for limited hardware
Extracting per-pixel feature descriptors
Re-detecting a lost object
Experiments
Implementation and system details.
Real time object following exprements
Zero-shot detection exprements
Mask quality experements
Discussion and conclusions

Figures (11)

Figure 1: Follow anything (FAn) is a real-time robotic system to detect, track, and follow objects in an open-vocabulary setting. Objects of interest may be specified using text descriptions, images, or clicks. FAn then leverages foundation models like CLIP radford2021learning, DINO caron2021emerging, and SAM kirillov2023segment to compute segmentation masks and bounding boxes that best align with the queried objects. These objects are tracked across video frames, accounting for occlusion and object re-emergence; enabling real-time following of objects of interest by a robot platform. On the right, we show FAn being deployed on a micro aerial vehicle (MAV), following an object of interest (here, another smaller drone). FAn can be deployed on a commodity laptop with a lightweight (6-8 GB) GPU, achieving real-time (6-20 fps) throughput. We encourage the reader to check our https://www.youtube.com/watch?v=6Mgt3EPytrw and to view the demos on our project webpage: https://github.com/alaamaalouf/FollowAnything.
Figure 2: FAn outputs illustrations on input frame of $4$ whales with a click query on a whale and a click query on water. First, SAM extracts multiple masks, then, based on DINO features, FAn classifies each mask to what object it refers to from the given queries (water/whales). Finally, whales are detected by assigning the masks whose DINO feature descriptor is closest to the whales' query descriptor. NOTE: Heat maps are shown in the click (query) figures.
Figure 3: Automatic detection experiments (SAM-and-DINO). Examples of our automatic detection scheme for detecting Drones, Bricks, and RC Cars. The examples include (from left to right): the original input frame, the outputs of SAM segmentation masks, and DINO+Cosine similarity semantic segmentation and detection.
Figure 4: Heat maps showing the pixels' semantic similarity. For every pixel, its feature descriptor is extracted then cosine similarity is computed between its descriptor and a focal point pixel descriptor (pointed at by a yellow arrow).
Figure 5: Automatic re-detection via cross trajectory stored ViT features. (1) At every frame, we store the DINO features representing the tracked object. (2) Once the object is lost, we (3) either apply a segmentation model or get suggested masks from the tracker, for every mask, we compute the DINO descriptors, and (4) compare it to the pre-computed ones. If a high similarity is obtained we keep tracking the object, else, we repeat (3) on the next frame.
...and 6 more figures

Follow Anything: Open-set detection, tracking, and following in real-time

TL;DR

Abstract

Follow Anything: Open-set detection, tracking, and following in real-time

Authors

TL;DR

Abstract

Table of Contents

Figures (11)