CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee; Taiki Miyanishi; Shuhei Kurita; Koya Sakamoto; Daichi Azuma; Yutaka Matsuo; Nakamasa Inoue

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Yutaka Matsuo, Nakamasa Inoue

TL;DR

CityNav provides the first large-scale real-world aerial VLN dataset with 32,637 human trajectories across Cambridge and Birmingham, backed by the CityFlight 3D environment. It introduces the geographic semantic map (GSM) to fuse OpenStreetMap-derived geography with visual cues, and demonstrates GSM improves three baseline aerial VLN models. Human demonstrations remain superior, and the work analyzes description length, landmark density, and robustness to aid future development. This dataset and methodology establish a foundation for robust landmark-aware aerial navigation in real-world urban environments.

Abstract

Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km$^2$ across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

TL;DR

Abstract

across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.

Paper Structure (56 sections, 15 figures, 7 tables)

This paper contains 56 sections, 15 figures, 7 tables.

Introduction
Related Work
Ground-level Datasets
Aerial Datasets
CityNav Dataset
CityFlight Environment
3D Scan Data
Action Space
OpenStreetMap
For the use of GNSS
Implementation Details
Task Definition
Goal Description
Starting Point
Success Criteria
...and 41 more sections

Figures (15)

Figure 1: CityNav is a new aerial navigation dataset consisting of 32,637 human demonstration trajectories across real-world cities.
Figure 2: CityFlight is a 3D environment for flight simulation. Five actions, each mapped to a keyboard key, allow movement and rotation of the UAV. The 3D environment is synchronized with OpenStreetMap. Human annotators are asked to navigate to the specified goal object within the 3D scene.
Figure 3: Distance to goal
Figure 4: Description length
Figure 5: Number of actions
...and 10 more figures

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

TL;DR

Abstract

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)