Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories
Alessandro Saviolo, Pratyaksh Rao, Vivek Radhakrishnan, Jiuhong Xiao, Giuseppe Loianno
TL;DR
This work tackles the problem of quadrotor visual tracking under diverse, real-world conditions by unifying perception and control through a foundation-model–driven detector, a robust multi-layer tracker, and a model-free visual controller. The detector achieves target-agnostic, real-time detection by leveraging foundation-model strengths, while the tracker fuses spatial, temporal, and appearance cues, including an EKF and memory-based appearance, to maintain target visibility. The visual controller derives on-board control signals solely from a monocular camera and IMU, centering the target in the image while reducing distance, with pitch-aware setpoints and an attitude PID to command the motors. Extensive indoor/outdoor experiments validate generalization to unseen categories, resilience to occlusions and disruptions, and robust navigation in narrow and cluttered environments, highlighting practical applicability on resource-constrained platforms.
Abstract
Visual control enables quadrotors to adaptively navigate using real-time sensory data, bridging perception with action. Yet, challenges persist, including generalization across scenarios, maintaining reliability, and ensuring real-time responsiveness. This paper introduces a perception framework grounded in foundation models for universal object detection and tracking, moving beyond specific training categories. Integral to our approach is a multi-layered tracker integrated with the foundation detector, ensuring continuous target visibility, even when faced with motion blur, abrupt light shifts, and occlusions. Complementing this, we introduce a model-free controller tailored for resilient quadrotor visual tracking. Our system operates efficiently on limited hardware, relying solely on an onboard camera and an inertial measurement unit. Through extensive validation in diverse challenging indoor and outdoor environments, we demonstrate our system's effectiveness and adaptability. In conclusion, our research represents a step forward in quadrotor visual tracking, moving from task-specific methods to more versatile and adaptable operations.
