Table of Contents
Fetching ...

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

Guiyong Zheng, Yueting Ban, Mingjie Zhang, Juepeng Zheng, Boyu Zhou

TL;DR

OnFly is proposed, a fully onboard, real-time framework for zero-shot AVLN that adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making.

Abstract

Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment. The code will be released at https://github.com/Robotics-STAR-Lab/OnFly

OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency

TL;DR

OnFly is proposed, a fully onboard, real-time framework for zero-shot AVLN that adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making.

Abstract

Aerial vision-language navigation (AVLN) enables UAVs to follow natural-language instructions in complex 3D environments. However, existing zero-shot AVLN methods often suffer from unstable single-stream Vision-Language Model decision-making, unreliable long-horizon progress monitoring, and a trade-off between safety and efficiency. We propose OnFly, a fully onboard, real-time framework for zero-shot AVLN. OnFly adopts a shared-perception dual-agent architecture that decouples high-frequency target generation from low-frequency progress monitoring, thereby stabilizing decision-making. It further employs a hybrid keyframe-recent-frame memory to preserve global trajectory context while maintaining KV-cache prefix stability, enabling reliable long-horizon monitoring with termination and recovery signals. In addition, a semantic-geometric verifier refines VLM-predicted targets for instruction consistency and geometric safety using VLM features and depth cues, while a receding-horizon planner generates optimized collision-free trajectories under geometric safety constraints, improving both safety and efficiency. In simulation, OnFly improves task success from 26.4% to 67.8%, compared with the strongest state-of-the-art baseline, while fully onboard real-world flights validate its feasibility for real-time deployment. The code will be released at https://github.com/Robotics-STAR-Lab/OnFly
Paper Structure (21 sections, 1 equation, 6 figures, 4 tables)

This paper contains 21 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Real-world example of onboard language-guided UAV navigation. The figure illustrates the UAV platform (left), the first-person onboard view (inset), and the executed flight trajectory (right).
  • Figure 2: System pipeline of OnFly. OnFly is a fully onboard system for AVLN with dual-agent decision-making and hybrid memory-based progress monitoring. The task manager splits a long instruction into subtasks and outputs the current one. The dual-agent module predicts candidate goals and monitors progress from onboard observations, where the Hybrid Memory consists of the initial frame, keyframes, and the latest frame, to preserve global trajectory context and KV-cache prefix stability. Semantic--geometric verification and a safety-aware planner then generate an executable real-time trajectory.
  • Figure 3: Hybrid Memory construction. The monitoring memory is built from the initial frame, selected keyframes, and the latest frame. It maintains a de-duplicated keyframe pool, selects representative keyframes for global coverage, and serializes them in a prefix-stable order for KV-cache reuse.
  • Figure 4: Simulation environments for benchmarking. (a) Examples of the 10 high-fidelity Unreal Engine scenes, covering diverse indoor and outdoor settings. (b) Task composition of the benchmark, including object navigation, precise navigation, and long-range navigation.
  • Figure 5: Simulation visualizations. Each row shows a navigation episode, with the language instruction at the top and representative first-person UAV keyframes along the trajectory.
  • ...and 1 more figures