Table of Contents
Fetching ...

OpenVLN: Open-world Aerial Vision-Language Navigation

Peican Lin, Gan Sun, Chenxi Liu, Fazeng Li, Weihong Ren, Yang Cong

TL;DR

OpenVLN tackles data scarcity and long-horizon UAV vision-language navigation by integrating a rule-based reinforcement-learning fine-tuning of a vision-language model with a value-model guided long-horizon planner. It introduces dense, verifiable rewards via a value model $V_\rho$ producing $R_t^{V}$ and uses a PPO-like objective with KL regularization to enable stable updates under limited data. A VLN-CE replanner handles trajectory synthesis for smooth, robust long-horizon navigation. Evaluations on TravelUAV with 25% data show consistent improvements in SR, OSR, and SPL, confirming enhanced robustness in open-world aerial environments; see $R_t^{V} = r_{level}$ if $\frac{1}{1- Sim(F_s_t, F_w_n)} \ge r_{level}$, otherwise $R_t^{V} = \frac{1}{1- Sim(F_s_t, F_w_n)}$ and a KL-regularized PPO objective.

Abstract

Vision-language models (VLMs) have been widely-applied in ground-based vision-language navigation (VLN). However, the vast complexity of outdoor aerial environments compounds data acquisition challenges and imposes long-horizon trajectory planning requirements on Unmanned Aerial Vehicles (UAVs), introducing novel complexities for aerial VLN. To address these challenges, we propose a data-efficient Open-world aerial Vision-Language Navigation (i.e., OpenVLN) framework, which could execute language-guided flight with limited data constraints and enhance long-horizon trajectory planning capabilities in complex aerial environments. Specifically, we reconfigure a reinforcement learning framework to optimize the VLM for UAV navigation tasks, which can efficiently fine-tune VLM by using rule-based policies under limited training data. Concurrently, we introduce a long-horizon planner for trajectory synthesis that dynamically generates precise UAV actions via value-based rewards. To the end, we conduct sufficient navigation experiments on the TravelUAV benchmark with dataset scaling across diverse reward settings. Our method demonstrates consistent performance gains of up to 4.34% in Success Rate, 6.19% in Oracle Success Rate, and 4.07% in Success weighted by Path Length over baseline methods, validating its deployment efficacy for long-horizon UAV navigation in complex aerial environments.

OpenVLN: Open-world Aerial Vision-Language Navigation

TL;DR

OpenVLN tackles data scarcity and long-horizon UAV vision-language navigation by integrating a rule-based reinforcement-learning fine-tuning of a vision-language model with a value-model guided long-horizon planner. It introduces dense, verifiable rewards via a value model producing and uses a PPO-like objective with KL regularization to enable stable updates under limited data. A VLN-CE replanner handles trajectory synthesis for smooth, robust long-horizon navigation. Evaluations on TravelUAV with 25% data show consistent improvements in SR, OSR, and SPL, confirming enhanced robustness in open-world aerial environments; see if , otherwise and a KL-regularized PPO objective.

Abstract

Vision-language models (VLMs) have been widely-applied in ground-based vision-language navigation (VLN). However, the vast complexity of outdoor aerial environments compounds data acquisition challenges and imposes long-horizon trajectory planning requirements on Unmanned Aerial Vehicles (UAVs), introducing novel complexities for aerial VLN. To address these challenges, we propose a data-efficient Open-world aerial Vision-Language Navigation (i.e., OpenVLN) framework, which could execute language-guided flight with limited data constraints and enhance long-horizon trajectory planning capabilities in complex aerial environments. Specifically, we reconfigure a reinforcement learning framework to optimize the VLM for UAV navigation tasks, which can efficiently fine-tune VLM by using rule-based policies under limited training data. Concurrently, we introduce a long-horizon planner for trajectory synthesis that dynamically generates precise UAV actions via value-based rewards. To the end, we conduct sufficient navigation experiments on the TravelUAV benchmark with dataset scaling across diverse reward settings. Our method demonstrates consistent performance gains of up to 4.34% in Success Rate, 6.19% in Oracle Success Rate, and 4.07% in Success weighted by Path Length over baseline methods, validating its deployment efficacy for long-horizon UAV navigation in complex aerial environments.

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Challenge illustration for long-distance Vision-Language Navigation task, where drone needs to navigate to the destination according to the given instruction. Extended planning trajectories propagate environmental uncertainty, compounding localization error accumulation.
  • Figure 2: Overall architecture of our OpenVLN framework. (a) The VLN-CE replanner that controls flight actions; (b) and (c) together form an Reinforcement Learning framework for data-efficient VLM fine-tuning under data scarcity, including a verifiable reward, value model reward at waypoints and an RL fine-tuning optimizer; and (d) a vision-language navigation model with a sensing encoder, multi-modal grounding moduleliuGroundingDINOMarrying2025, and an action decoder that predicts the next waypoint, producing the planned trajectory and the UAV pose.
  • Figure 3: Comparison between the baseline and our proposed method. Rows 1–2: our UAV navigator successively searches and reliably reaches the destination by progressively detecting the instructed objectives one by one. Row 3: with the baseline method the drone collides with the building; the mission fails because of overlapping occlusion.
  • Figure 4: The evaluation results with three levels of the assitance respectively. (a)-(c) The radar diagram demonstrating the Success Rate (SR), Oracle Success Rate (OSR) and Success weighted by Path Length (SPL) performances with Level 1-3 assitance, higher is better. (d) The Average Normalized Error in the TravelUAV benchmark evaluation, lower is better.