Table of Contents
Fetching ...

BEVDiffLoc: End-to-End LiDAR Global Localization in BEV View based on Diffusion Model

Ziyue Wang, Chenghao Shi, Neng Wang, Qinghua Yu, Xieyuanli Chen, Huimin Lu

TL;DR

BEVDiffLoc tackles end-to-end LiDAR localization in BEV space to reduce map storage while improving robustness. It introduces a diffusion-based pose refinement framework conditioned on features learned from a Maximum Feature Aggregation module and a Vision Transformer, enabling orientation-equivariant representations. A BEV-specific data augmentation strategy expands training diversity by stitching multiple frames into a local map and sampling diverse observations. Training minimizes an $L_1$ loss on the diffusion model's predicted noise, and experiments on the Oxford and NCLT datasets show state-of-the-art localization and robustness with open-source code.

Abstract

Localization is one of the core parts of modern robotics. Classic localization methods typically follow the retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird's-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making it an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms the baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.

BEVDiffLoc: End-to-End LiDAR Global Localization in BEV View based on Diffusion Model

TL;DR

BEVDiffLoc tackles end-to-end LiDAR localization in BEV space to reduce map storage while improving robustness. It introduces a diffusion-based pose refinement framework conditioned on features learned from a Maximum Feature Aggregation module and a Vision Transformer, enabling orientation-equivariant representations. A BEV-specific data augmentation strategy expands training diversity by stitching multiple frames into a local map and sampling diverse observations. Training minimizes an loss on the diffusion model's predicted noise, and experiments on the Oxford and NCLT datasets show state-of-the-art localization and robustness with open-source code.

Abstract

Localization is one of the core parts of modern robotics. Classic localization methods typically follow the retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird's-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making it an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms the baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.

Paper Structure

This paper contains 15 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison between classic localization approaches and end-to-end localization methods. Classic localization approaches typically follow a two-step process: place recognition followed by pose estimation. This approach demands substantial memory resources to store map database. Furthermore, their reliance on multiple cascaded modules makes optimization challenging. In contrast, end-to-end methods implicitly encode environmental information into network parameters, resulting in a simpler structure and reduced storage requirements.
  • Figure 2: The pipeline of the BEVDiffLoc framework. First, it merges continuous $M$ frames of point cloud data with an interval of $S$ frames and generates new BEV images from random positions with random orientations. Then, the MFA module and ViT module are employed to extract robust features ${\hbox{\sffamily\slshape{F}}}^i_P$, effectively addressing the challenges posed by varying orientations. The pose estimation is modeled as denoising task, ultimately yielding the predicted poses.
  • Figure 3: Localization results on 17-14-03-00 in the Oxford RobotCar dataset. Successful localizations are marked in red, while failed localizations are marked in black, respectively. The star denotes the start frame. The caption of each subfigure displays the SR.
  • Figure 4: Localization results on 2012-02-19 in the NCLT dataset. We use a heatmap with a range of 5 to represent orientation errors. The star denotes the start frame. The caption of each subfigure displays the ${e}_t$ and ${e}_y$.