Table of Contents
Fetching ...

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Peng Xu, Zhengnan Deng, Jiayan Deng, Zonghua Gu, Shaohua Wan

Abstract

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.

AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control

Abstract

Vision-Language Navigation (VLN) for Unmanned Aerial Vehicles (UAVs) demands complex visual interpretation and continuous control in dynamic 3D environments. Existing hierarchical approaches rely on dense oracle guidance or auxiliary object detectors, creating semantic gaps and limiting genuine autonomy. We propose AerialVLA, a minimalist end-to-end Vision-Language-Action framework mapping raw visual observations and fuzzy linguistic instructions directly to continuous physical control signals. First, we introduce a streamlined dual-view perception strategy that reduces visual redundancy while preserving essential cues for forward navigation and precise grounding, which additionally facilitates future simulation-to-reality transfer. To reclaim genuine autonomy, we deploy a fuzzy directional prompting mechanism derived solely from onboard sensors, completely eliminating the dependency on dense oracle guidance. Ultimately, we formulate a unified control space that integrates continuous 3-Degree-of-Freedom (3-DoF) kinematic commands with an intrinsic landing signal, freeing the agent from external object detectors for precision landing. Extensive experiments on the TravelUAV benchmark demonstrate that AerialVLA achieves state-of-the-art performance in seen environments. Furthermore, it exhibits superior generalization in unseen scenarios by achieving nearly three times the success rate of leading baselines, validating that a minimalist, autonomy-centric paradigm captures more robust visual-motor representations than complex modular systems.
Paper Structure (20 sections, 2 equations, 4 figures, 5 tables)

This paper contains 20 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of UAV navigation paradigms. While existing modular methods (red) rely on oracle guidance and external detectors, AerialVLA (green) achieves agile autonomous navigation and precise landing via a unified end-to-end policy driven by fuzzy onboard hints and intrinsic stopping.
  • Figure 2: The architecture of AerialVLA. The framework processes multimodal inputs to generate continuous control signals end-to-end. (a) Language Input constructs prompts with fuzzy directional hints derived from the IMU, eliminating oracle reliance. (b) Visual Input fuses front and down views via a vertical mosaic. (c) The AerialVLA Model utilizes a Llama-2 backbone with LoRA to autoregressively predict numerical tokens. (d) Output and Action Stage decodes tokens into spatial offsets for velocity control, or triggers the dual-condition landing.
  • Figure 3: AerialVLA prompt formulation. The structured prompt comprises four components: (i) an <image> token for visual input, (ii) a fuzzy directional hint (red), (iii) a detailed target description, and (iv) the corresponding numerical control actions (blue).
  • Figure 4: Qualitative visualization of our proposed AerialVLA. We display the vertical mosaic inputs (Front/Down) at key timesteps. The agent demonstrates precision maneuvering in clutter (Top) and active error correction against distractors (Bottom), validating the robustness of the end-to-end policy.