Table of Contents
Fetching ...

Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment

Yuchen Zou, Xiao Hu, Dexing Zhong, Yuqing Tang

TL;DR

This paper proposes a hierarchical search framework with semantic alignment for localization in OpenStreetMap, using the semantic awareness capability of DINO-ViT to deconstruct visual elements to establish semantic relationships with OSM, and proposes a coarse-to-fine search paradigm to replace global dense matching.

Abstract

Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.

Coarse-to-Fine Monocular Re-Localization in OpenStreetMap via Semantic Alignment

TL;DR

This paper proposes a hierarchical search framework with semantic alignment for localization in OpenStreetMap, using the semantic awareness capability of DINO-ViT to deconstruct visual elements to establish semantic relationships with OSM, and proposes a coarse-to-fine search paradigm to replace global dense matching.

Abstract

Monocular re-localization plays a crucial role in enabling intelligent agents to achieve human-like perception. However, traditional methods rely on dense maps, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight map that protects privacy, offers semantic and geometric information with global scalability. Nonetheless, there are still challenges in using OSM for localization: the inherent cross-modal discrepancies between natural images and OSM, as well as the high computational cost of global map-based localization. In this paper, we propose a hierarchical search framework with semantic alignment for localization in OSM. First, the semantic awareness capability of DINO-ViT is utilised to deconstruct visual elements to establish semantic relationships with OSM. Second, a coarse-to-fine search paradigm is designed to replace global dense matching, enabling efficient progressive refinement. Extensive experiments demonstrate that our method significantly improves both localization accuracy and speed. When trained on a single dataset, the 3° orientation recall of our method even outperforms the 5° recall of state-of-the-art methods.
Paper Structure (18 sections, 8 equations, 4 figures, 3 tables)

This paper contains 18 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Localization in the OSM-derived semantic map by transforming the query image into semantic features.
  • Figure 2: Visualization of street view image features from VGG19, ResNet101, and DINOv2, where the same color represents the same semantics shared.
  • Figure 3: Coarse-to-fine semantic localization framework for OSM localization. First, the query image is transformed into semantic features. Then, the sliced OSM tile is converted into a semantic map. A coarse pose is estimated through coarse semantic matching. Finally, based on the uncertainty of the coarse pose, a refined match is returned.
  • Figure 4: Visualization of the coarse-to-fine localization process, including refinement of the neural map transformed from OSM, and refinement of the pose likelihood during matching. The red arrow represents the predicted pose, and the black arrow represents the ground truth pose.