Table of Contents
Fetching ...

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu

TL;DR

This work tackles aerial vision-and-language navigation (VLN) using only monocular RGB observations, removing reliance on panoramic views, depth sensors, or odometry. It formulates navigation as a unified next-token prediction problem, guided by task-specific prompts that instantiate spatial perception, trajectory reasoning, and embodied navigation within a single multimodal backbone. Key contributions include a data preprocessing pipeline (action merging and keyframe selection), a prompt-driven multitask learning framework, and an action parsing/execution scheme, all trained on a curated multi-task dataset. Empirical results on the AerialVLN-S benchmark show state-of-the-art performance among RGB-only approaches and a reduced gap to panoramic RGB-D methods, with ablations confirming the efficacy of auxiliary tasks, history representations, and preprocessing strategies. The approach promises practical applicability for lightweight UAVs by delivering robust long-horizon instruction following using only onboard monocular vision.

Abstract

Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning

TL;DR

This work tackles aerial vision-and-language navigation (VLN) using only monocular RGB observations, removing reliance on panoramic views, depth sensors, or odometry. It formulates navigation as a unified next-token prediction problem, guided by task-specific prompts that instantiate spatial perception, trajectory reasoning, and embodied navigation within a single multimodal backbone. Key contributions include a data preprocessing pipeline (action merging and keyframe selection), a prompt-driven multitask learning framework, and an action parsing/execution scheme, all trained on a curated multi-task dataset. Empirical results on the AerialVLN-S benchmark show state-of-the-art performance among RGB-only approaches and a reduced gap to panoramic RGB-D methods, with ablations confirming the efficacy of auxiliary tasks, history representations, and preprocessing strategies. The approach promises practical applicability for lightweight UAVs by delivering robust long-horizon instruction following using only onboard monocular vision.

Abstract

Aerial Vision-and-Language Navigation (VLN) aims to enable unmanned aerial vehicles (UAVs) to interpret natural language instructions and navigate complex urban environments using onboard visual observation. This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery. Existing methods often rely on panoramic images, depth inputs, or odometry to support spatial reasoning and action planning. These requirements increase system cost and integration complexity, thus hindering practical deployment for lightweight UAVs. We present a unified aerial VLN framework that operates solely on egocentric monocular RGB observations and natural language instructions. The model formulates navigation as a next-token prediction problem, jointly optimizing spatial perception, trajectory reasoning, and action prediction through prompt-guided multi-task learning. Moreover, we propose a keyframe selection strategy to reduce visual redundancy by retaining semantically informative frames, along with an action merging and label reweighting mechanism that mitigates long-tailed supervision imbalance and facilitates stable multi-task co-training. Extensive experiments on the Aerial VLN benchmark validate the effectiveness of our method. Under the challenging monocular RGB-only setting, our model achieves strong results across both seen and unseen environments. It significantly outperforms existing RGB-only baselines and narrows the performance gap with state-of-the-art panoramic RGB-D counterparts. Comprehensive ablation studies further demonstrate the contribution of our task design and architectural choices.

Paper Structure

This paper contains 21 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Aerial vision-language navigation. Left: A drone receives a natural-language instruction along with egocentric visual observations and is required to navigate to the destination in a complex outdoor environment. Right: This task relies on the agent’s ability to maintain an accurate understanding of its navigational situation, including estimating its current position, interpreting its progress within the instruction, and determining the next movement consistent with the described route. The example highlights these dimensions of temporal and spatial reasoning, which are central to reliable long-horizon aerial navigation.
  • Figure 2: Trajectory statistics before and after data preprocessing, showing that action merging and keyframe selection yield a richer action space and more compact navigation sequences. For brevity, we use MF for move forward etc.
  • Figure 3: Overview of our framework. Given egocentric keyframes selected from the onboard video stream, our model first encodes the visual observations through a vision encoder and a MLP projector to obtain visual tokens, while language instructions are processed by a text tokenizer. The unified multimodal tokens are then fed into a large language model that is jointly trained on three complementary tasks: (i) Spatial Perception, which queries the current scene; (ii) Trajectory Reasoning, which summarizes historical motion and infers the agent’s navigational context; and (iii) Embodied Navigation, which predicts high-level action commands. The predicted textual action is further parsed and decomposed into a sequence of predefined motion primitives for execution in the physical environment.
  • Figure 4: Unified prompting interface for the proposed model. Through task-specific prompts, the model supports aerial navigation as the primary task, while also handling spatial perception and trajectory reasoning as auxiliary capabilities that enrich spatial understanding and temporal grounding.
  • Figure 5: Qualitative comparison between predicted and ground-truth drone trajectories in 3D space. Across multiple validation episodes, the predicted paths follow the global route structure, capturing major turns, long-range transitions, and altitude changes required by language instruction. These results highlight strong 3D spatial reasoning and stable flight control from egocentric observations.
  • ...and 4 more figures