LiDAR Registration with Visual Foundation Models
Niclas Vödisch, Giovanni Cioffi, Marco Cannici, Wolfram Burgard, Davide Scaramuzza
TL;DR
The paper tackles long-term LiDAR-to-map registration under environmental and domain shifts by using DINOv2 features extracted from surround-view images as point descriptors. These descriptors are attached to both LiDAR points and map points via point-to-pixel projection, enabling a cosine-similarity matching stage that feeds a global RANSAC-based coarse alignment followed by ICP refinement. The approach, which requires no domain-specific retraining and is agnostic to point cloud density, achieves substantial performance gains over diverse baselines on the NCLT and Oxford Radar RobotCar datasets (e.g., $+24.8$ and $+17.3$ percentage points in registration recall) and demonstrates robustness to seasonal and long-term environmental changes. The work provides a public benchmark and code to spur further research in long-term map-based localization for mobile robots, with future directions including direct visual-to-point projections and richer semantic-geometry fusion.
Abstract
LiDAR registration is a fundamental task in robotic mapping and localization. A critical component of aligning two point clouds is identifying robust point correspondences using point descriptors. This step becomes particularly challenging in scenarios involving domain shifts, seasonal changes, and variations in point cloud structures. These factors substantially impact both handcrafted and learning-based approaches. In this paper, we address these problems by proposing to use DINOv2 features, obtained from surround-view images, as point descriptors. We demonstrate that coupling these descriptors with traditional registration algorithms, such as RANSAC or ICP, facilitates robust 6DoF alignment of LiDAR scans with 3D maps, even when the map was recorded more than a year before. Although conceptually straightforward, our method substantially outperforms more complex baseline techniques. In contrast to previous learning-based point descriptors, our method does not require domain-specific retraining and is agnostic to the point cloud structure, effectively handling both sparse LiDAR scans and dense 3D maps. We show that leveraging the additional camera data enables our method to outperform the best baseline by +24.8 and +17.3 registration recall on the NCLT and Oxford RobotCar datasets. We publicly release the registration benchmark and the code of our work on https://vfm-registration.cs.uni-freiburg.de.
