Table of Contents
Fetching ...

Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation

Zhenxing Xu, Brikit Lu, Weidong Bao, Zhengqiu Zhu, Junsong Zhang, Hui Yan, Wenhao Lu, Ji Wang

TL;DR

Fly0 is proposed, a framework that decouples semantic reasoning from geometric planning that outperforms state-of-the-art baselines, reduces computational overhead and improves system stability.

Abstract

Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real-world environments demonstrate that Fly0 outperforms state-of-the-art baselines, improving the Success Rate by over 20\% and reducing Navigation Error (NE) by approximately 50\% in unstructured environments. Our code is available at https://github.com/xuzhenxing1/Fly0.

Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation

TL;DR

Fly0 is proposed, a framework that decouples semantic reasoning from geometric planning that outperforms state-of-the-art baselines, reduces computational overhead and improves system stability.

Abstract

Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real-world environments demonstrate that Fly0 outperforms state-of-the-art baselines, improving the Success Rate by over 20\% and reducing Navigation Error (NE) by approximately 50\% in unstructured environments. Our code is available at https://github.com/xuzhenxing1/Fly0.
Paper Structure (31 sections, 10 equations, 6 figures, 3 tables)

This paper contains 31 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of three navigation architectures. (a) End-to-End models directly output actions. (b) Methods using MLLM as a controller typically rely on waypoint prediction and topological maps. (c) Our proposed Fly0 framework leverages the semantic reasoning of MLLMs for precise Coordinate Acquisition and Geometric Progression, enabling the generation of smooth metric trajectories in a zero-shot manner.
  • Figure 2: Schematic diagram of the proposed visual-language navigation system. The framework leverages an MLLM to bridge the gap between semantic instructions and metric navigation. By grounding the user's command into a 2D target position, the system employs an un-projection module to derive the corresponding spatial coordinates. This precise 3D localization enables the Ego-Planner ego to compute optimal, obstacle-avoiding paths, as demonstrated in the field tests shown on the right.
  • Figure 3: The execution pipeline from visual perception to trajectory optimization. First, the Input Stream combines RGB-D images and instructions to locate the target in the image frame. Next, the Back-Projection Engine maps this 2D point to the 3D world frame via the pinhole camera model. Finally, based on the projected 3D target and local LiDAR sensing, the Ego-Planner optimizes the flight path by solving a gradient-based problem to ensure collision-free and kinematically feasible motion.
  • Figure 4: The full text of the prompt used to guide the MLLM. It instructs the model to locate the destination specified in the user's command within the 2D image frame and return its pixel coordinates, handling cases of both target existence and absence.
  • Figure 5: Qualitative visualization of a sequential navigation task in a complex urban environment. The UAV executes a multi-step composite instruction (bottom) sequentially. The three rows correspond to the three sub-tasks: approaching the tree, navigating to the streetlight, and reaching the final destination. The columns display the onboard First-Person View (FPV) with LiDAR perception, followed by Frontal and Lateral views of the generated collision-free trajectory (purple curve).
  • ...and 1 more figures