Table of Contents
Fetching ...

RING#: PR-by-PE Global Localization with Roto-translation Equivariant Gram Learning

Sha Lu, Xuecheng Xu, Yuxuan Wu, Haojian Lu, Xieyuanli Chen, Rong Xiong, Yue Wang

TL;DR

This work proposes RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors, and incorporates a novel design that learns two equivariant representations from BEV features, enabling globally convergent and computationally efficient PE.

Abstract

Global localization using onboard perception sensors, such as cameras and LiDARs, is crucial in autonomous driving and robotics applications when GPS signals are unreliable. Most approaches achieve global localization by sequential place recognition (PR) and pose estimation (PE). Some methods train separate models for each task, while others employ a single model with dual heads, trained jointly with separate task-specific losses. However, the accuracy of localization heavily depends on the success of place recognition, which often fails in scenarios with significant changes in viewpoint or environmental appearance. Consequently, this renders the final pose estimation of localization ineffective. To address this, we introduce a new paradigm, PR-by-PE localization, which bypasses the need for separate place recognition by directly deriving it from pose estimation. We propose RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors. RING# incorporates a novel design that learns two equivariant representations from BEV features, enabling globally convergent and computationally efficient pose estimation. Comprehensive experiments on the NCLT and Oxford datasets show that RING# outperforms state-of-the-art methods in both vision and LiDAR modalities, validating the effectiveness of the proposed approach. The code will be publicly released.

RING#: PR-by-PE Global Localization with Roto-translation Equivariant Gram Learning

TL;DR

This work proposes RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors, and incorporates a novel design that learns two equivariant representations from BEV features, enabling globally convergent and computationally efficient PE.

Abstract

Global localization using onboard perception sensors, such as cameras and LiDARs, is crucial in autonomous driving and robotics applications when GPS signals are unreliable. Most approaches achieve global localization by sequential place recognition (PR) and pose estimation (PE). Some methods train separate models for each task, while others employ a single model with dual heads, trained jointly with separate task-specific losses. However, the accuracy of localization heavily depends on the success of place recognition, which often fails in scenarios with significant changes in viewpoint or environmental appearance. Consequently, this renders the final pose estimation of localization ineffective. To address this, we introduce a new paradigm, PR-by-PE localization, which bypasses the need for separate place recognition by directly deriving it from pose estimation. We propose RING#, an end-to-end PR-by-PE localization network that operates in the bird's-eye-view (BEV) space, compatible with both vision and LiDAR sensors. RING# incorporates a novel design that learns two equivariant representations from BEV features, enabling globally convergent and computationally efficient pose estimation. Comprehensive experiments on the NCLT and Oxford datasets show that RING# outperforms state-of-the-art methods in both vision and LiDAR modalities, validating the effectiveness of the proposed approach. The code will be publicly released.
Paper Structure (39 sections, 4 theorems, 38 equations, 21 figures, 9 tables)

This paper contains 39 sections, 4 theorems, 38 equations, 21 figures, 9 tables.

Key Result

Lemma 1

$\phi_h(f(x))$ is translation equivariant.

Figures (21)

  • Figure 1: Comparison on the various paradigms of global localization. (a) The PR-then-PE localization paradigm treats place recognition and pose estimation as upstream and downstream tasks, either handled by two independent models (a.1) or jointly learned within a single model (a.2). (b) We introduce a novel paradigm: PR-by-PE localization, which leverages pose estimation to derive place recognition in a single model.
  • Figure 2: PR-by-PE localization framework. Given a raw sensor observation, we encode it into BEV features first. Based on the BEV features, we construct two equivariant representations that enable the decoupling of pose estimation into rotation estimation and translation estimation.
  • Figure 3: Overview of the PR-by-PE localization framework RING#. 1. Our BEV generation module converts inputs from multi-view images $\mathcal{I}$ or a LiDAR point cloud $P$ into BEV features $B$. 2. Using the Radon Transform (RT), a Convolutional Neural Network (CNN), and the Fourier Transform (FT), the rotation branch transforms $B$ into rotation-equivariant and translation-invariant representations $A$ and then uses 1D cross-correlation to estimate the relative rotation $\hat{\theta}$. 3. The translation branch compensates for the relative rotation $\theta$which equals to the ground truth rotation $\theta^{*}$ during training and equals to the estimated rotation by the rotation $\hat{\theta}$ branch during inference and uses a CNN to yield rotation-invariant and translation-equivariant representations $\widetilde{B}_t$. Subsequent 2D cross-correlation is employed to determine the relative translation $\hat{x}, \hat{y}$. RING# is supervised by poses only in an end-to-end manner.
  • Figure 4: Top 1 retrieved matches for protocol 1 on the NCLT dataset. (a) NetVLAD arandjelovic2016netvlad. (b) Patch-NetVLAD hausler2021patch. (c) AnyLoc keetha2023anyloc. (d) SFRS ge2020self. (e) Exhaustive SS detone2018superpointsarlin2020superglue. (f) BEV-NetVLAD-MLP. (g) vDiSCO xu2023leveraging. (h) RING#-V (Ours). (i) OverlapTransformer ma2022overlaptransformer. (j) LCDNet cattaneo2022lcdnet. (k) DiSCO xu2021disco. (l) RING lu2022one. (m) RING++ xu2023ring++. (n) EgoNN komorowski2021egonn. (o) RING#-L (Ours). The black line represents the trajectory, the green line represents the correct retrieval match, and the red line represents the wrong retrieval match.
  • Figure 5: Qualitative vision examples of some queries and their top 1 retrieved matches on the NCLT dataset. The red rectangle $\square$ represents the wrong retrieval result and the green rectangle $\square$ represents the correct retrieval result.
  • ...and 16 more figures

Theorems & Definitions (10)

  • Definition 1: Equivariance
  • Definition 2: Invariance
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof