Table of Contents
Fetching ...

SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments

Wen Jiang, Kangyao Huang, Li Wang, Wang Xu, Wei Fan, Jinyuan Liu, Shaoyu Liu, Hanfang Liang, Hongwei Duan, Bin Xu, Xiangyang Ji

Abstract

UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.

SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments

Abstract

UAVs play an important role in applications such as autonomous exploration, disaster response, and infrastructure inspection. However, UAV VLN in complex 3D environments remains challenging. A key difficulty is the structural representation mismatch between 2D visual perception and the 3D trajectory decision space, which limits spatial reasoning. To this end, we propose SpatialFly, a geometry-guided spatial representation framework for UAV VLN. Operating on RGB observations without explicit 3D reconstruction, SpatialFly introduces a geometry-guided 2D representation alignment mechanism. Specifically, the geometric prior injection module injects global structural cues into 2D semantic tokens to provide scene-level geometric guidance. The geometry-aware reparameterization module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention, followed by gated residual fusion to preserve semantic discrimination. Experimental results show that SpatialFly consistently outperforms state-of-the-art UAV VLN baselines across both seen and unseen environments, reducing NE by 4.03m and improving SR by 1.27% over the strongest baseline on the unseen Full split. Additional trajectory-level analysis shows that SpatialFly produces trajectories with better path alignment and smoother, more stable motion.
Paper Structure (29 sections, 16 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 29 sections, 16 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Motivation and overview of SpatialFly. UAV VLN suffers from a structural representation mismatch between multi-view 2D visual perception and continuous 3D trajectory decision making. Without explicit geometric cues, multi-view RGB observations often lead to inconsistent spatial understanding and unstable 3D path prediction. SpatialFly addresses this gap through the geometry-guided representation alignment mechanism, improving cross-view consistency and more reliable 3D trajectory prediction.
  • Figure 2: Overall architecture of SpatialFly. Given multi-view RGB observations, the language instruction, and the UAV state, SpatialFly extracts 2D semantic tokens and implicit 3D geometric tokens. The GPI module injects global structural cues into 2D semantic tokens for scene-level geometric guidance. The GAR module then aligns 2D semantic tokens with 3D geometric tokens through cross-modal attention and gated fusion. The aligned visual representations are then integrated with language and state tokens for downstream action prediction.
  • Figure 3: Illustration of the Geometric Prior Injection (GPI) module. Implicit 3D geometric tokens are first summarized by mean pooling to obtain a global geometric representation, which is then passed through a modulation MLP to generate the FiLM parameters $\gamma$ and $\beta$. These modulation terms are applied to the 2D base tokens in a FiLM-like manner, and the resulting features are further combined with the original tokens through a learnable injection strength $\eta$ to produce geometry-injected representations.
  • Figure 4: Quantitative comparison of different methods on the seen setting. The left y-axis denotes NE, and the right y-axis denotes SR, OSR, and SPL.
  • Figure 5: Quantitative comparison on unseen overall and unseen map settings. Blue and orange bars denote the results on unseen overall and unseen map, respectively. Left and right y-axes correspond to NE and SR/OSR/SPL, respectively.
  • ...and 5 more figures