KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments
Shibo Huang, Chenfan Shi, Jian Yang, Hanlin Dong, Jinpeng Mi, Ke Li, Jianfeng Zhang, Miao Ding, Peidong Liang, Xiong You, Xian Wei
TL;DR
KiteRunner tackles open-world outdoor navigation under dynamic conditions by jointly addressing semantic grounding and long-range spatial reasoning. It fuses a Vision-Language Processor, a diffusion-model-based Local Planner, and a UAV orthophoto-based Global Planner to ground natural-language instructions into feasible, long-distance trajectories; the Local Planner generates 8 candidate paths via a DDPM conditioned on current perception, while the Global Planner builds a traversability probability map from Digital Orthophoto Maps and guides path choice with a semantic- and geometry-aware score $Score(a^{(k)})=\sum_{i=1}^{N} P_m(x_i,y_i) \cdot P_w(x_i,y_i)$ and a training loss $\mathcal{L} = -\lambda \cdot (1-p_i)^{\gamma} \cdot \log(p_i)$. The Vision-Language Processor grounds commands through GPT-4o-derived landmarks and CLIP-based visual-semantic matching, yielding $Q(v)=\sum_{j=1}^n S_{v,j}-\beta\,D(v_{start},v)$ for path optimization. Empirical results show significant improvements in path efficiency and reductions in interventions and execution time across structured and unstructured environments, validating the approach’s robustness and real-time adaptability. These findings highlight the practical potential of language-grounded, globally informed navigation for outdoor robotics.
Abstract
Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large-scale long-distance navigation. To address these gaps, this paper proposes KiteRunner, a language-driven cooperative local-global navigation strategy that combines UAV orthophoto-based global planning with diffusion model-driven local path generation for long-distance navigation in open-world scenarios. Our method innovatively leverages real-time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state-of-the-art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.
