Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories

Alessandro Saviolo; Pratyaksh Rao; Vivek Radhakrishnan; Jiuhong Xiao; Giuseppe Loianno

Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories

Alessandro Saviolo, Pratyaksh Rao, Vivek Radhakrishnan, Jiuhong Xiao, Giuseppe Loianno

TL;DR

This work tackles the problem of quadrotor visual tracking under diverse, real-world conditions by unifying perception and control through a foundation-model–driven detector, a robust multi-layer tracker, and a model-free visual controller. The detector achieves target-agnostic, real-time detection by leveraging foundation-model strengths, while the tracker fuses spatial, temporal, and appearance cues, including an EKF and memory-based appearance, to maintain target visibility. The visual controller derives on-board control signals solely from a monocular camera and IMU, centering the target in the image while reducing distance, with pitch-aware setpoints and an attitude PID to command the motors. Extensive indoor/outdoor experiments validate generalization to unseen categories, resilience to occlusions and disruptions, and robust navigation in narrow and cluttered environments, highlighting practical applicability on resource-constrained platforms.

Abstract

Visual control enables quadrotors to adaptively navigate using real-time sensory data, bridging perception with action. Yet, challenges persist, including generalization across scenarios, maintaining reliability, and ensuring real-time responsiveness. This paper introduces a perception framework grounded in foundation models for universal object detection and tracking, moving beyond specific training categories. Integral to our approach is a multi-layered tracker integrated with the foundation detector, ensuring continuous target visibility, even when faced with motion blur, abrupt light shifts, and occlusions. Complementing this, we introduce a model-free controller tailored for resilient quadrotor visual tracking. Our system operates efficiently on limited hardware, relying solely on an onboard camera and an inertial measurement unit. Through extensive validation in diverse challenging indoor and outdoor environments, we demonstrate our system's effectiveness and adaptability. In conclusion, our research represents a step forward in quadrotor visual tracking, moving from task-specific methods to more versatile and adaptable operations.

Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories

TL;DR

Abstract

Paper Structure (17 sections, 18 equations, 7 figures)

This paper contains 17 sections, 18 equations, 7 figures.

Introduction
Related Works
Methodology
Target-Agnostic Real-Time Detection
Multi-layered Tracking
Spatial Coherence
Temporal Consistency
Appearance Robustness with Memory
Visual Control
Experimental Results
Setup
Perception Generalization Performance
Tracking Resiliency Post-Disruption
Tracker Ablation Study
Flight Through Narrow Featureless Corridors
...and 2 more sections

Figures (7)

Figure 1: Our proposed framework for detecting (white), tracking (red), and following arbitrary targets using quadrotors. The user-prompted target is detected and tracked over time by combining a real-time foundation detector with our novel multi-layered tracker. The quadrotor is continually controlled to navigate toward the target while maintaining it in the robot's view.
Figure 2: Despite the YOLO baseline's extensive training on 80 categories, it struggles to recognize custom drones, irregular trash cans, and pool noodles. In contrast, our foundation detector showcases significant adaptability and robustness, accurately identifying these unique objects without prior specific training.
Figure 3: Our detection and tracking algorithm's resilience and accuracy. Top row: Indoor spatial-temporal tracking of a human against occlusions. Bottom row: Tracking of our custom-made drone, highlighting re-identification capabilities.
Figure 4: Demonstrating our system's ability to detect, track, and navigate toward an asymmetrically shaped trash can within a featureless narrow corridor, hence highlighting the efficacy of our visual control method.
Figure 5: The human target transitions from outdoor to indoor settings, showcasing our system's ability to adapt to rapid lighting shifts and maintain consistent tracking.
...and 2 more figures

Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories

TL;DR

Abstract

Unifying Foundation Models with Quadrotor Control for Visual Tracking Beyond Object Categories

Authors

TL;DR

Abstract

Table of Contents

Figures (7)