Table of Contents
Fetching ...

KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments

Shibo Huang, Chenfan Shi, Jian Yang, Hanlin Dong, Jinpeng Mi, Ke Li, Jianfeng Zhang, Miao Ding, Peidong Liang, Xiong You, Xian Wei

TL;DR

KiteRunner tackles open-world outdoor navigation under dynamic conditions by jointly addressing semantic grounding and long-range spatial reasoning. It fuses a Vision-Language Processor, a diffusion-model-based Local Planner, and a UAV orthophoto-based Global Planner to ground natural-language instructions into feasible, long-distance trajectories; the Local Planner generates 8 candidate paths via a DDPM conditioned on current perception, while the Global Planner builds a traversability probability map from Digital Orthophoto Maps and guides path choice with a semantic- and geometry-aware score $Score(a^{(k)})=\sum_{i=1}^{N} P_m(x_i,y_i) \cdot P_w(x_i,y_i)$ and a training loss $\mathcal{L} = -\lambda \cdot (1-p_i)^{\gamma} \cdot \log(p_i)$. The Vision-Language Processor grounds commands through GPT-4o-derived landmarks and CLIP-based visual-semantic matching, yielding $Q(v)=\sum_{j=1}^n S_{v,j}-\beta\,D(v_{start},v)$ for path optimization. Empirical results show significant improvements in path efficiency and reductions in interventions and execution time across structured and unstructured environments, validating the approach’s robustness and real-time adaptability. These findings highlight the practical potential of language-grounded, globally informed navigation for outdoor robotics.

Abstract

Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large-scale long-distance navigation. To address these gaps, this paper proposes KiteRunner, a language-driven cooperative local-global navigation strategy that combines UAV orthophoto-based global planning with diffusion model-driven local path generation for long-distance navigation in open-world scenarios. Our method innovatively leverages real-time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state-of-the-art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.

KiteRunner: Language-Driven Cooperative Local-Global Navigation Policy with UAV Mapping in Outdoor Environments

TL;DR

KiteRunner tackles open-world outdoor navigation under dynamic conditions by jointly addressing semantic grounding and long-range spatial reasoning. It fuses a Vision-Language Processor, a diffusion-model-based Local Planner, and a UAV orthophoto-based Global Planner to ground natural-language instructions into feasible, long-distance trajectories; the Local Planner generates 8 candidate paths via a DDPM conditioned on current perception, while the Global Planner builds a traversability probability map from Digital Orthophoto Maps and guides path choice with a semantic- and geometry-aware score and a training loss . The Vision-Language Processor grounds commands through GPT-4o-derived landmarks and CLIP-based visual-semantic matching, yielding for path optimization. Empirical results show significant improvements in path efficiency and reductions in interventions and execution time across structured and unstructured environments, validating the approach’s robustness and real-time adaptability. These findings highlight the practical potential of language-grounded, globally informed navigation for outdoor robotics.

Abstract

Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large-scale long-distance navigation. To address these gaps, this paper proposes KiteRunner, a language-driven cooperative local-global navigation strategy that combines UAV orthophoto-based global planning with diffusion model-driven local path generation for long-distance navigation in open-world scenarios. Our method innovatively leverages real-time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state-of-the-art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: KiteRunner accepts natural language instructions and realizes long-distance outdoor navigation through the cooperation of local planner and global planner. (a) natural language commands, (b) the Local Planner generating context-aware candidate paths, and (c) the Global Planner providing traversability guidance.
  • Figure 2: KiteRunner is suitable for outdoor navigation tasks in structured and unstructured scenes. The proposed method employs the Vision-Language Processor (VLP) to infer optimal navigation paths adhering to natural language instructions. By strategically integrating trajectory outputs from the Local Planner with traversability probability maps generated by the Global Planner, this framework achieves efficient language-guided outdoor navigation.
  • Figure 3: Visualization of local path prediction. LP generates multiple local paths based on the current observation and subgoal image, of which GP selects the green path.
  • Figure 4: Global UAV map(b) provides updated land cover information compared to satellite map(a) in the study area
  • Figure 5: Language-driven instructions guide navigation tasks in structured (left) and unstructured (right) environments. Key landmarks marked with stars include the red road post, solar panel, gate, and other reference points.