Table of Contents
Fetching ...

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, Yu-Lun Liu

TL;DR

This work tackles zero-shot UAV navigation with free-form language instructions by repurposing frozen vision-language models for spatial grounding. SPF grounds 2D waypoints in the image, then lifts them to 3D actions via camera geometry and an adaptive travel-distance controller, enabling a lightweight closed-loop UAV policy without any task-specific training. It achieves state-of-the-art performance on the DRLSim2024 simulator and strong real-world results on a DJI Tello, significantly outperforming prior zero-shot baselines across long-horizon, obstacle-rich, and dynamic scenarios. The method generalizes across multiple VLM backbones and maintains robust performance under varying conditions, though it faces challenges from VLM hallucinations and latency that motivate future refinements in grounding fidelity and responsiveness.

Abstract

We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

TL;DR

This work tackles zero-shot UAV navigation with free-form language instructions by repurposing frozen vision-language models for spatial grounding. SPF grounds 2D waypoints in the image, then lifts them to 3D actions via camera geometry and an adaptive travel-distance controller, enabling a lightweight closed-loop UAV policy without any task-specific training. It achieves state-of-the-art performance on the DRLSim2024 simulator and strong real-world results on a DJI Tello, significantly outperforming prior zero-shot baselines across long-horizon, obstacle-rich, and dynamic scenarios. The method generalizes across multiple VLM backbones and maintains robust performance under varying conditions, though it faces challenges from VLM hallucinations and latency that motivate future refinements in grounding fidelity and responsiveness.

Abstract

We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev

Paper Structure

This paper contains 20 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Zero-shot language-guided UAV control. (a) The UAV continually replans to keep pace with a moving person. (b) The UAV chains multiple goals across the hall. (c) The UAV locates the person on the ground and navigates around obstacles. Coloured 3D boxes mark successive camera viewpoints, revealing the UAV’s full flight trajectory over the reconstructed point cloud. All waypoints are generated directly by the vision-language model, with no task-specific training.
  • Figure 2: Pipeline overview. A camera frame and user instructions enter a frozen vision-language model, which returns a structured JSON with a 2D waypoint and any obstacle boxes. An Action-to-Control layer converts this output into low-level velocity commands (yaw, throttle, pitch) that steer the UAV. The loop repeats until the task is completed.
  • Figure 3: Control-geometry details of our VLM-driven flight loop. A frozen vision-language model first predicts a 2D waypoint $(u, v)$ and a discrete depth cue $d_\text{VLM}$. (a) A nonlinear scaling curve converts $d_\text{VLM}$ into an adaptive step size $d_\text{adj}$, letting the UAV take larger strides in open space and smaller ones near obstacles. (b) The pair $(u, v, d_\text{adj})$ is unprojected through the pin-hole model to a 3D displacement vector $(S_x, S_y, S_z)$ in the UAV's body frame. (c) This vector is decomposed into control primitives: yaw $\Delta\theta=\text{tan}^{-1}(S_x/S_y)$, pitch $\Delta \text{Pitch}=\sqrt{{S_x}^2+{S_y}^2}$, and throttle $\Delta \text{Throttle}=S_z$. These quantities are sent as timed velocity commands by the execution layer. The perception, planning, and control cycle repeats until the language instruction is fulfilled.
  • Figure 4: Qualitative comparison of flight trajectories in the simulator. Trajectory of our method is colored in green, PIVOT nasiriany2024pivot in blue, and TypeFly chen2023typefly in purple. The absence of a colored path indicates the baseline failed to issue any fly command. Full videos are included in the supplementary materials.
  • Figure 5: Qualitative comparison of flight trajectories in the real-world. Trajectory of our method compared to other baselines in the real-world testing. Take off trajectory is colored in green and task trajectory in magenta. Please refer to the supplementary materials for full videos.
  • ...and 2 more figures