Table of Contents
Fetching ...

BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images

David Skuddis, Vincent Ress, Wei Zhang, Vincent Ofosu Nyako, Norbert Haala

Abstract

We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird's-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.

BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images

Abstract

We present BEV-SLD, a LiDAR global localization method building on the Scene Landmark Detection (SLD) concept. Unlike scene-agnostic pipelines, our self-supervised approach leverages bird's-eye-view (BEV) images to discover scene-specific patterns at a prescribed spatial density and treat them as landmarks. A consistency loss aligns learnable global landmark coordinates with per-frame heatmaps, yielding consistent landmark detections across the scene. Across campus, industrial, and forest environments, BEV-SLD delivers robust localization and achieves strong performance compared to state-of-the-art methods.
Paper Structure (22 sections, 8 equations, 10 figures, 4 tables)

This paper contains 22 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Scene landmark-based localization on the MCD dataset mcd_dataset. Left: BEV density image from a local point cloud with predicted landmark locations. Right: Scene landmarks shown as blue squares. Lines indicate correctly predicted correspondences, and the red arrow marks the estimated pose. Real-time localization requires only a 20 MB representation consisting of the network and landmark list.
  • Figure 2: Overview of the proposed landmark learning framework (left) and the localization pipeline at inference time (right).
  • Figure 3: Initial landmarks and landmarks after joint learning of landmark positions and detection in a factory floor environment: filled orange circles represent landmark positions at initialization and green non-filled circles after training. Orange lines connect corresponding landmarks. The background point cloud is shown for visualization only.
  • Figure 4: Network architecture. Res. Blocks represent residual blocks res_block, Down Blocks consist of max pooling followed by a residual block and Conv. Blocks consist of convolution followed by layer normalization and activation function.
  • Figure 5: Qualitative comparison of BEVPlace++ luo2024bevplaceplusplus trained on KITTI (K), LightLoc li2025lightloc, KISS-Matcher lim2024kiss, PosePN++ YU2022108685 and the proposed method. Top left: reference trajectory (ntu_day_01, orange) and ground truth trajectory of the test sequence (ntu_day_10, green). Others: inliers (errors $<$2 m, 5$^\circ$) of the different methods. Our method achieves broader inlier pose coverage than the baselines.
  • ...and 5 more figures