Table of Contents
Fetching ...

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

Hang Wu, Zhenghao Zhang, Siyuan Lin, Xiangru Mu, Qiang Zhao, Ming Yang, Tong Qin

TL;DR

MapLocNet tackles GPS-denied urban localization by fusing surround-view imagery with navigation maps in an HD-map-free framework. It introduces a hierarchical coarse-to-fine feature registration approach based on a transformer that aligns visual BEV features with map features, producing a 3-DoF pose offset $\boldsymbol{\hat{\xi}}=(\Delta x,\Delta y,\Delta \theta)$. The architecture comprises a BEV Module, a Map U-Net, and a Neural Localization Module, with a two-stage registration and joint supervision via BEV and map segmentation losses. The method achieves state-of-the-art localization accuracy and inference speed on nuScenes and Argoverse, while eliminating reliance on costly HD maps, enabling scalable, real-time localization in challenging urban environments.

Abstract

Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has emerged as a promising low-cost alternative for localization. Current approaches based on navigation maps can achieve highly accurate localization, but their complex matching strategies lead to unacceptable inference latency that fails to meet the real-time demands. To address these limitations, we propose a novel transformer-based neural re-localization method. Inspired by image registration, our approach performs a coarse-to-fine neural feature registration between navigation map and visual bird's-eye view features. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets, which is nearly 10%/20% localization accuracy and 30/16 FPS improvement on single-view and surround-view input settings, separately. We highlight that our research presents an HD-map-free localization method for autonomous driving, offering cost-effective, reliable, and scalable performance in challenging driving environments.

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

TL;DR

MapLocNet tackles GPS-denied urban localization by fusing surround-view imagery with navigation maps in an HD-map-free framework. It introduces a hierarchical coarse-to-fine feature registration approach based on a transformer that aligns visual BEV features with map features, producing a 3-DoF pose offset . The architecture comprises a BEV Module, a Map U-Net, and a Neural Localization Module, with a two-stage registration and joint supervision via BEV and map segmentation losses. The method achieves state-of-the-art localization accuracy and inference speed on nuScenes and Argoverse, while eliminating reliance on costly HD maps, enabling scalable, real-time localization in challenging urban environments.

Abstract

Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has emerged as a promising low-cost alternative for localization. Current approaches based on navigation maps can achieve highly accurate localization, but their complex matching strategies lead to unacceptable inference latency that fails to meet the real-time demands. To address these limitations, we propose a novel transformer-based neural re-localization method. Inspired by image registration, our approach performs a coarse-to-fine neural feature registration between navigation map and visual bird's-eye view features. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets, which is nearly 10%/20% localization accuracy and 30/16 FPS improvement on single-view and surround-view input settings, separately. We highlight that our research presents an HD-map-free localization method for autonomous driving, offering cost-effective, reliable, and scalable performance in challenging driving environments.
Paper Structure (37 sections, 6 equations, 4 figures, 4 tables)

This paper contains 37 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Due to signal occlusion and multipath errors, the GPS-based positioning is unreliable in complex urban environments. To address this problem, we propose MapLocNet, which leverages surround-view images and navigation maps, utilizing consecutive neural localization modules based on coarse-to-fine feature registration principles to achieve superior localization accuracy in challenging scenarios.
  • Figure 2: The overall architecture of MapLocNet comprises three main modules: the BEV Module, Map U-Net, and Neural Localization Module. Our approach employs a coarse-to-fine feature registration strategy, extracting multi-scale features from both the BEV Decoder and Map Decoder to perform hierarchical feature alignment. Following the initial coarse registration stage, which yields a coarse estimate of the pose offset, we apply a spatial transformation to the high-resolution BEV features to facilitate the subsequent fine registration process. The predictions from both stages are combined to yield the final pose offset estimation result.
  • Figure 3: Visualization of the original navigation map, its rasterized representation, and corresponding surround-view images. Rasterization enhances the expression of topological cues and spatial layout, emphasizing key elements like lane lines and building areas.
  • Figure 4: Visualization of localization results in the nuScenes dataset during day and night. The middle two columns depict high-resolution, low-channel BEV features and map features, respectively. The white dots and bounding boxes in the map features represent the GT locations and orientations of the BEV features. In the last column, black dots and bounding boxes represent the corrected locations and orientations on the map after applying the offsets predicted by the model.