Table of Contents
Fetching ...

TOL: Textual Localization with OpenStreetMap

Youqi Liao, Shuhao Kang, Jingyu Xu, Olaf Wysocki, Yan Xia, Jianping Li, Zhen Dong, Bisheng Yang, Xieyuanli Chen

Abstract

Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.

TOL: Textual Localization with OpenStreetMap

Abstract

Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.

Paper Structure

This paper contains 38 sections, 20 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) Motivation. Text-to-OSM localization retrieves the most similar OSM tile from the database first, and then estimates the accurate 2-DoF position later. (b) shows the difference with existing methods. Compared with text-to-point-cloud localization methods, OSM data is much lighter in data construction, storage, and updates. Compared with existing text-driven place recognition methods, our approach focuses on global localization with meter-level accuracy, instead of simply retrieving OSM tiles from textual queries. This enables finer-grained spatial understanding and more precise localization beyond tile-level retrieval.
  • Figure 2: Data distribution of TOL dataset. (a) and (b) show Boston and Singapore parts of the TOL-N set, respectively. (c) shows the trajectory of TOL-K360 set in Karsluhe.
  • Figure 3: Illustration rasterized OSM tiles.
  • Figure 4: Pipeline of TOLoc. With the query text $\boldsymbol{\mathcal{T}}$ and OSM database $\mathbb{O}=\{\mathcal{O}_j\}_{j=1}^{Z}$, TOLoc performs global localization in two stages. Place recognition stage first learns text–map correspondences via contrastive learning. Pose estimation stage then aligns the text descriptor $\boldsymbol{d}_{\mathcal{T}}$ with the features of the top-1 retrieved map tile $\boldsymbol{F}_O$ to regress the precise location within the local map. TOA module employs self-attention and cross-attention mechanisms to effectively integrate local OSM features with textual information and estimate the 2-DoF position $\boldsymbol{\xi} = (x, y)$.
  • Figure 5: Recall curve @25m threshold of top-$K$ candidates on TOLoc-N and TOLoc-K360 sets.
  • ...and 3 more figures