Table of Contents
Fetching ...

L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery

Ziwei Shi, Xiaoran Zhang, Wenjing Xu, Yan Xia, Yu Zang, Siqi Shen, Cheng Wang

TL;DR

This work tackles large-scale LiDAR-based place recognition without relying on costly pre-built 3D maps by aligning LiDAR BEV submaps with high-resolution remote sensing imagery through semantic contrastive learning. It introduces the LiRSI-XA dataset and the L2RSI framework, which is augmented by Spatial-Temporal Particle Estimation to fuse temporal information and refine localization over sequences. The approach achieves strong cross-view, cross-modal localization performance across over $100km^2$ of urban area, with Recall@1 around $83.27\%$ and robust generalization to new scenes without finetuning. The methodology offers practical implications for scalable, cost-effective autonomous navigation, combining semantic alignment with a probabilistic spatio-temporal refinement that runs in real time.

Abstract

We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately $110,000$ remote sensing submaps and $13,000$ LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a $100km^2$ retrieval range, L2RSI accurately localizes $83.27\%$ of point cloud submaps within a $30m$ radius for top-$1$ retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.

L2RSI: Cross-view LiDAR-based Place Recognition for Large-scale Urban Scenes via Remote Sensing Imagery

TL;DR

This work tackles large-scale LiDAR-based place recognition without relying on costly pre-built 3D maps by aligning LiDAR BEV submaps with high-resolution remote sensing imagery through semantic contrastive learning. It introduces the LiRSI-XA dataset and the L2RSI framework, which is augmented by Spatial-Temporal Particle Estimation to fuse temporal information and refine localization over sequences. The approach achieves strong cross-view, cross-modal localization performance across over of urban area, with Recall@1 around and robust generalization to new scenes without finetuning. The methodology offers practical implications for scalable, cost-effective autonomous navigation, combining semantic alignment with a probabilistic spatio-temporal refinement that runs in real time.

Abstract

We tackle the challenge of LiDAR-based place recognition, which traditionally depends on costly and time-consuming prior 3D maps. To overcome this, we first construct LiRSI-XA dataset, which encompasses approximately remote sensing submaps and LiDAR point cloud submaps captured in urban scenes, and propose a novel method, L2RSI, for cross-view LiDAR place recognition using high-resolution Remote Sensing Imagery. This approach enables large-scale localization capabilities at a reduced cost by leveraging readily available overhead images as map proxies. L2RSI addresses the dual challenges of cross-view and cross-modal place recognition by learning feature alignment between point cloud submaps and remote sensing submaps in the semantic domain. Additionally, we introduce a novel probability propagation method based on particle estimation to refine position predictions, effectively leveraging temporal and spatial information. This approach enables large-scale retrieval and cross-scene generalization without fine-tuning. Extensive experiments on LiRSI-XA demonstrate that, within a retrieval range, L2RSI accurately localizes of point cloud submaps within a radius for top- retrieved location. Our project page is publicly available at https://shizw695.github.io/L2RSI/.

Paper Structure

This paper contains 25 sections, 11 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: (Left) We propose L2RSI for cross-view LiDAR-based place recognition without 3D pre-built map. Given a point cloud submap representing the surroundings of the robot and the absolute orientation provided by a LiDAR and a magnetometer, L2RSI provides the most probable location in a large-scale city using high-resolution remote sensing imagery. (Right) Place recognition performance in different retrieval ranges. Notably, L2RSI achieved a recall ($\!<\!30m$) of over $80\%$ at Top-1 retrieval within a range of $100km^2$.
  • Figure 2: Overview of the proposed L2RSI. It consists of three modules: the data preprocessing module (left), the training framework for Global Descriptor Extraction network (middle) and the inference framework (right).
  • Figure 3: Dataset configuration. In LiRSI-XA \ref{['fig:XAdata']}, different colored rectangles highlight test sets from databases of varying sizes, all sharing the same query trajectory (green). Besides, the training trajectory (light blue) and discarded trajectory at the boundary (red) are annotated. In LiRSI-Oxford \ref{['fig:Oxforddata']}, the green trajectory is as the query for the test set, while the entire remote sensing imagery serves as the database.
  • Figure 4: Ablation study about the number of queries in the sequence ($L$) and the number of particles for each query ($K$). Solid dots represent the optimal performance for each sequence length.
  • Figure 5: Visualization of retrieval results on LiRSI-XA \ref{['fig:XA_recall_visual']} and LiRSI-Oxford \ref{['fig:Oxford_recall_visual']}. Points closer to red indicate that more retrieval results are required to achieve correct retrieval at that location. (I) shows the results without STPE, and (II) shows the results with STPE.
  • ...and 2 more figures