Follow Anything: Open-set detection, tracking, and following in real-time
Alaa Maalouf, Ninad Jadhav, Krishna Murthy Jatavallabhula, Makram Chahine, Daniel M. Vogt, Robert J. Wood, Antonio Torralba, Daniela Rus
TL;DR
Follow Anything (FAn) tackles the problem of real-time open-set object following in robotics by enabling detection, tracking, and following of any object defined through multimodal queries. The approach fuses segmentation from SAM with per-pixel features from DINO and CLIP, representing objects via region descriptors $v_j$ and measuring similarity to a query descriptor $v$ with $\cos(v_j,v) = \frac{v_j^T v}{\left\lVert v_j\right\rVert \left\lVert v\right\rVert + \varepsilon}$, enabling open-vocabulary detection and multi-target labeling in video streams. Key contributions include an open-vocabulary detection pipeline, hardware-conscious speedups (coarse DINO-based detections, quantization, tracing, and FastSAM), a per-pixel feature extraction strategy, and a robust re-detection mechanism that stores cross-trajectory ViT descriptors to recover from tracking loss. Experiments on a quadrotor demonstrate real-time following at $6$ to $20$ FPS on an $8$ GB GPU, effective occlusion handling, and zero-shot detection across diverse objects, with the code openly released for adoption and extension.
Abstract
Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader to watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .
