Table of Contents
Fetching ...

LiDAR Registration with Visual Foundation Models

Niclas Vödisch, Giovanni Cioffi, Marco Cannici, Wolfram Burgard, Davide Scaramuzza

TL;DR

The paper tackles long-term LiDAR-to-map registration under environmental and domain shifts by using DINOv2 features extracted from surround-view images as point descriptors. These descriptors are attached to both LiDAR points and map points via point-to-pixel projection, enabling a cosine-similarity matching stage that feeds a global RANSAC-based coarse alignment followed by ICP refinement. The approach, which requires no domain-specific retraining and is agnostic to point cloud density, achieves substantial performance gains over diverse baselines on the NCLT and Oxford Radar RobotCar datasets (e.g., $+24.8$ and $+17.3$ percentage points in registration recall) and demonstrates robustness to seasonal and long-term environmental changes. The work provides a public benchmark and code to spur further research in long-term map-based localization for mobile robots, with future directions including direct visual-to-point projections and richer semantic-geometry fusion.

Abstract

LiDAR registration is a fundamental task in robotic mapping and localization. A critical component of aligning two point clouds is identifying robust point correspondences using point descriptors. This step becomes particularly challenging in scenarios involving domain shifts, seasonal changes, and variations in point cloud structures. These factors substantially impact both handcrafted and learning-based approaches. In this paper, we address these problems by proposing to use DINOv2 features, obtained from surround-view images, as point descriptors. We demonstrate that coupling these descriptors with traditional registration algorithms, such as RANSAC or ICP, facilitates robust 6DoF alignment of LiDAR scans with 3D maps, even when the map was recorded more than a year before. Although conceptually straightforward, our method substantially outperforms more complex baseline techniques. In contrast to previous learning-based point descriptors, our method does not require domain-specific retraining and is agnostic to the point cloud structure, effectively handling both sparse LiDAR scans and dense 3D maps. We show that leveraging the additional camera data enables our method to outperform the best baseline by +24.8 and +17.3 registration recall on the NCLT and Oxford RobotCar datasets. We publicly release the registration benchmark and the code of our work on https://vfm-registration.cs.uni-freiburg.de.

LiDAR Registration with Visual Foundation Models

TL;DR

The paper tackles long-term LiDAR-to-map registration under environmental and domain shifts by using DINOv2 features extracted from surround-view images as point descriptors. These descriptors are attached to both LiDAR points and map points via point-to-pixel projection, enabling a cosine-similarity matching stage that feeds a global RANSAC-based coarse alignment followed by ICP refinement. The approach, which requires no domain-specific retraining and is agnostic to point cloud density, achieves substantial performance gains over diverse baselines on the NCLT and Oxford Radar RobotCar datasets (e.g., and percentage points in registration recall) and demonstrates robustness to seasonal and long-term environmental changes. The work provides a public benchmark and code to spur further research in long-term map-based localization for mobile robots, with future directions including direct visual-to-point projections and richer semantic-geometry fusion.

Abstract

LiDAR registration is a fundamental task in robotic mapping and localization. A critical component of aligning two point clouds is identifying robust point correspondences using point descriptors. This step becomes particularly challenging in scenarios involving domain shifts, seasonal changes, and variations in point cloud structures. These factors substantially impact both handcrafted and learning-based approaches. In this paper, we address these problems by proposing to use DINOv2 features, obtained from surround-view images, as point descriptors. We demonstrate that coupling these descriptors with traditional registration algorithms, such as RANSAC or ICP, facilitates robust 6DoF alignment of LiDAR scans with 3D maps, even when the map was recorded more than a year before. Although conceptually straightforward, our method substantially outperforms more complex baseline techniques. In contrast to previous learning-based point descriptors, our method does not require domain-specific retraining and is agnostic to the point cloud structure, effectively handling both sparse LiDAR scans and dense 3D maps. We show that leveraging the additional camera data enables our method to outperform the best baseline by +24.8 and +17.3 registration recall on the NCLT and Oxford RobotCar datasets. We publicly release the registration benchmark and the code of our work on https://vfm-registration.cs.uni-freiburg.de.

Paper Structure

This paper contains 13 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Initialization-free registration of a LiDAR scan to a large-scale 3D map requires highly expressive point descriptors. We demonstrate that DINOv2 oquab2024dinov2 features from surround-view images allow finding robust point correspondences, even with map data recorded more than a year before. In the map, the registered LiDAR scan is shown in red. The colors of the LiDAR scan (top right) and the map (bottom) are obtained using principal component analysis performed on the high-dimensional DINOv2 features.
  • Figure 2: Overview of our proposed approach for 6DoF point cloud registration. First, we extract DINOv2 oquab2024dinov2 features from surround-view image data. These features are then attached to the point cloud as point descriptors via point-to-pixel projection. Second, we perform a point-wise similarity search using cosine similarity between the descriptors of the LiDAR scan and the descriptors of the voxelized 3D map. Finally, we use a traditional coarse-to-fine registration scheme with RANSAC fischler1987ransac and point-to-point ICP vizzo2023kissicp for obtaining a highly accurate pose estimate within the provided map frame.
  • Figure 3: We visualize the registration recall (RR) for a range of success thresholds obtained by linear inter/extrapolation of the thresholds used by GCL Liu2023gcl (left dashed line) and SpinNet ao2021spinnet (right dashed line). To perform scan-to-map registration, we couple RANSAC with the specified point descriptors.
  • Figure 4: To show the robustness of our proposed approach, we remove semantic entities from the 3D map. (1) The original map. (2) We identify tree-like points colored in red using the DINOv2-based descriptors. (3) We assign these points to separate clusters shown in different colors. (4) We randomly remove some clusters from the 3D map, highlighted by the red boxes.
  • Figure 5: We visualize the registration recall after ICP refinement (ICP-RR) for the removal and insertion of trees into the 3D map. Unlike the baseline methods, our proposed DINOv2-based point descriptor results in stable registration underlining its robustness.
  • ...and 2 more figures