Table of Contents
Fetching ...

CMRNext: Camera to LiDAR Matching in the Wild for Localization and Extrinsic Calibration

Daniele Cattaneo, Abhinav Valada

TL;DR

CMRNext tackles the challenge of cross-modal camera-LiDAR matching for monocular localization in LiDAR maps and extrinsic calibration by decoupling dense pixel-to-3D matching from pose estimation. It uses a RAFT-based network to predict per-pixel correspondences and uncertainties, then solves the PnP problem with RANSAC to recover the camera pose, enabling generalization across unseen sensor setups without retraining. The method demonstrates state-of-the-art results on multiple public datasets and three in-house platforms, with substantial gains from iterative refinement and temporal aggregation for calibration. The work provides open-source code and models, highlighting practical impact for scalable, camera-based localization in LiDAR-supported environments.

Abstract

LiDARs are widely used for mapping and localization in dynamic environments. However, their high cost limits their widespread adoption. On the other hand, monocular localization in LiDAR maps using inexpensive cameras is a cost-effective alternative for large-scale deployment. Nevertheless, most existing approaches struggle to generalize to new sensor setups and environments, requiring retraining or fine-tuning. In this paper, we present CMRNext, a novel approach for camera-LIDAR matching that is independent of sensor-specific parameters, generalizable, and can be used in the wild for monocular localization in LiDAR maps and camera-LiDAR extrinsic calibration. CMRNext exploits recent advances in deep neural networks for matching cross-modal data and standard geometric techniques for robust pose estimation. We reformulate the point-pixel matching problem as an optical flow estimation problem and solve the Perspective-n-Point problem based on the resulting correspondences to find the relative pose between the camera and the LiDAR point cloud. We extensively evaluate CMRNext on six different robotic platforms, including three publicly available datasets and three in-house robots. Our experimental evaluations demonstrate that CMRNext outperforms existing approaches on both tasks and effectively generalizes to previously unseen environments and sensor setups in a zero-shot manner. We make the code and pre-trained models publicly available at http://cmrnext.cs.uni-freiburg.de .

CMRNext: Camera to LiDAR Matching in the Wild for Localization and Extrinsic Calibration

TL;DR

CMRNext tackles the challenge of cross-modal camera-LiDAR matching for monocular localization in LiDAR maps and extrinsic calibration by decoupling dense pixel-to-3D matching from pose estimation. It uses a RAFT-based network to predict per-pixel correspondences and uncertainties, then solves the PnP problem with RANSAC to recover the camera pose, enabling generalization across unseen sensor setups without retraining. The method demonstrates state-of-the-art results on multiple public datasets and three in-house platforms, with substantial gains from iterative refinement and temporal aggregation for calibration. The work provides open-source code and models, highlighting practical impact for scalable, camera-based localization in LiDAR-supported environments.

Abstract

LiDARs are widely used for mapping and localization in dynamic environments. However, their high cost limits their widespread adoption. On the other hand, monocular localization in LiDAR maps using inexpensive cameras is a cost-effective alternative for large-scale deployment. Nevertheless, most existing approaches struggle to generalize to new sensor setups and environments, requiring retraining or fine-tuning. In this paper, we present CMRNext, a novel approach for camera-LIDAR matching that is independent of sensor-specific parameters, generalizable, and can be used in the wild for monocular localization in LiDAR maps and camera-LiDAR extrinsic calibration. CMRNext exploits recent advances in deep neural networks for matching cross-modal data and standard geometric techniques for robust pose estimation. We reformulate the point-pixel matching problem as an optical flow estimation problem and solve the Perspective-n-Point problem based on the resulting correspondences to find the relative pose between the camera and the LiDAR point cloud. We extensively evaluate CMRNext on six different robotic platforms, including three publicly available datasets and three in-house robots. Our experimental evaluations demonstrate that CMRNext outperforms existing approaches on both tasks and effectively generalizes to previously unseen environments and sensor setups in a zero-shot manner. We make the code and pre-trained models publicly available at http://cmrnext.cs.uni-freiburg.de .
Paper Structure (29 sections, 13 equations, 10 figures, 12 tables)

This paper contains 29 sections, 13 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Our proposed CMRNext estimates the 6-DoF transformation between camera images and LiDAR scans in the wild. It can be readily employed for monocular localization in LiDAR maps and camera-LiDAR extrinsic calibration.
  • Figure 2: Overview of the proposed approach: the input camera image and LiDAR-image are fed to CMRNext, which predicts dense correspondences between the two inputs. The predicted matches are used to localize the camera within the LiDAR-map by solving the Perspective-n-Point problem.
  • Figure 3:
  • Figure 4: The network architecture of CMRNext is based upon RAFT teed2020raft. The camera image and the LiDAR image are processed by the image and LiDAR encoders, respectively, and their features are used to compute a multi-scale cost volume. The LiDAR image is additionally processed by the context encoder, which is then used together with the cost volume to iteratively refine the optical flow using a GRU module. The output of the network is a pixel-wise camera-LiDAR displacement map and the corresponding uncertainty map. The displacement flow is color-coded based on baker2011database, while the uncertainty is colorized based on the (normalized) sum of the component-wise uncertainties $\sigma_u + \sigma_v$.
  • Figure 5: Qualitative results of CMRNext on the monocular localization task. From left to right: LiDAR image projected in the initial pose, ground truth pose, and pose predicted by CMRNext. All LiDAR projections are overlaid with the respective RGB image for visualization purposes. From top to bottom: KITTI, Argoverse, Pandaset, and Freiburg-Car datasets.
  • ...and 5 more figures