MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

Hang Wu; Zhenghao Zhang; Siyuan Lin; Xiangru Mu; Qiang Zhao; Ming Yang; Tong Qin

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

Hang Wu, Zhenghao Zhang, Siyuan Lin, Xiangru Mu, Qiang Zhao, Ming Yang, Tong Qin

TL;DR

MapLocNet tackles GPS-denied urban localization by fusing surround-view imagery with navigation maps in an HD-map-free framework. It introduces a hierarchical coarse-to-fine feature registration approach based on a transformer that aligns visual BEV features with map features, producing a 3-DoF pose offset $\boldsymbol{\hat{\xi}}=(\Delta x,\Delta y,\Delta \theta)$. The architecture comprises a BEV Module, a Map U-Net, and a Neural Localization Module, with a two-stage registration and joint supervision via BEV and map segmentation losses. The method achieves state-of-the-art localization accuracy and inference speed on nuScenes and Argoverse, while eliminating reliance on costly HD maps, enabling scalable, real-time localization in challenging urban environments.

Abstract

Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has emerged as a promising low-cost alternative for localization. Current approaches based on navigation maps can achieve highly accurate localization, but their complex matching strategies lead to unacceptable inference latency that fails to meet the real-time demands. To address these limitations, we propose a novel transformer-based neural re-localization method. Inspired by image registration, our approach performs a coarse-to-fine neural feature registration between navigation map and visual bird's-eye view features. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets, which is nearly 10%/20% localization accuracy and 30/16 FPS improvement on single-view and surround-view input settings, separately. We highlight that our research presents an HD-map-free localization method for autonomous driving, offering cost-effective, reliable, and scalable performance in challenging driving environments.

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

TL;DR

. The architecture comprises a BEV Module, a Map U-Net, and a Neural Localization Module, with a two-stage registration and joint supervision via BEV and map segmentation losses. The method achieves state-of-the-art localization accuracy and inference speed on nuScenes and Argoverse, while eliminating reliance on costly HD maps, enabling scalable, real-time localization in challenging urban environments.

Abstract

Paper Structure (37 sections, 6 equations, 4 figures, 4 tables)

This paper contains 37 sections, 6 equations, 4 figures, 4 tables.

Introduction
literature review
Localization Using Navigation Maps
BEV Representation for Visual Localization
Image Registration
End-to-end Localization Neural Networks
methodology
Problem Formulation and System Overview
Map Processing
Map Rasterization
Segmentation Labels
BEV Module
Map U-Net
Neural Localization Module
Loss Function
...and 22 more sections

Figures (4)

Figure 1: Due to signal occlusion and multipath errors, the GPS-based positioning is unreliable in complex urban environments. To address this problem, we propose MapLocNet, which leverages surround-view images and navigation maps, utilizing consecutive neural localization modules based on coarse-to-fine feature registration principles to achieve superior localization accuracy in challenging scenarios.
Figure 2: The overall architecture of MapLocNet comprises three main modules: the BEV Module, Map U-Net, and Neural Localization Module. Our approach employs a coarse-to-fine feature registration strategy, extracting multi-scale features from both the BEV Decoder and Map Decoder to perform hierarchical feature alignment. Following the initial coarse registration stage, which yields a coarse estimate of the pose offset, we apply a spatial transformation to the high-resolution BEV features to facilitate the subsequent fine registration process. The predictions from both stages are combined to yield the final pose offset estimation result.
Figure 3: Visualization of the original navigation map, its rasterized representation, and corresponding surround-view images. Rasterization enhances the expression of topological cues and spatial layout, emphasizing key elements like lane lines and building areas.
Figure 4: Visualization of localization results in the nuScenes dataset during day and night. The middle two columns depict high-resolution, low-channel BEV features and map features, respectively. The white dots and bounding boxes in the map features represent the GT locations and orientations of the BEV features. In the last column, black dots and bounding boxes represent the corrected locations and orientations on the map after applying the offsets predicted by the model.

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

TL;DR

Abstract

MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

Authors

TL;DR

Abstract

Table of Contents

Figures (4)