Table of Contents
Fetching ...

CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration

Gongxin Yao, Yixin Xuan, Xinyang Li, Yu Pan

TL;DR

CMR-Agent reframes image-to-point cloud registration as an iterative Markov decision process to bridge the cross-modal gap between RGB images and LiDAR maps. It combines one-shot cross-modal embeddings with a 2D-3D hybrid state and a PPO-based actor-critic policy to progressively refine the camera pose in $SE(3)$. A point-to-point $D_{p2p}$ reward and imitation-learning initialization enable stable, fast training, while reusing one-shot embeddings keeps inference efficient across iterations. Empirical results on KITTI-Odometry and NuScenes show competitive accuracy and strong efficiency, with 10 iterations taking about 68 ms on a $NVIDIA$ RTX $3090$, highlighting the method’s practicality for camera localization in pre-built LiDAR maps. The work demonstrates a scalable, interpretable cross-modal registration framework with potential impact on low-cost, robust localization systems for autonomous driving.

Abstract

Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud. It plays an important role in camera localization within pre-built LiDAR maps. Despite the modality gaps, most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization, resulting in poor accuracy and interpretability. In this paper, we propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose based on each intermediate state. To achieve this, we employ reinforcement learning to develop a cross-modal registration agent (CMR-Agent), and use imitation learning to initialize its registration policy for stability and quick-start of the training. According to the cross-modal observations, we propose a 2D-3D hybrid state representation that fully exploits the fine-grained features of RGB images while reducing the useless neutral states caused by the spatial truncation of camera frustum. Additionally, the overall framework is well-designed to efficiently reuse one-shot cross-modal embeddings, avoiding repetitive and time-consuming feature extraction. Extensive experiments on the KITTI-Odometry and NuScenes datasets demonstrate that CMR-Agent achieves competitive accuracy and efficiency in registration. Once the one-shot embeddings are completed, each iteration only takes a few milliseconds.

CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration

TL;DR

CMR-Agent reframes image-to-point cloud registration as an iterative Markov decision process to bridge the cross-modal gap between RGB images and LiDAR maps. It combines one-shot cross-modal embeddings with a 2D-3D hybrid state and a PPO-based actor-critic policy to progressively refine the camera pose in . A point-to-point reward and imitation-learning initialization enable stable, fast training, while reusing one-shot embeddings keeps inference efficient across iterations. Empirical results on KITTI-Odometry and NuScenes show competitive accuracy and strong efficiency, with 10 iterations taking about 68 ms on a RTX , highlighting the method’s practicality for camera localization in pre-built LiDAR maps. The work demonstrates a scalable, interpretable cross-modal registration framework with potential impact on low-cost, robust localization systems for autonomous driving.

Abstract

Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud. It plays an important role in camera localization within pre-built LiDAR maps. Despite the modality gaps, most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization, resulting in poor accuracy and interpretability. In this paper, we propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose based on each intermediate state. To achieve this, we employ reinforcement learning to develop a cross-modal registration agent (CMR-Agent), and use imitation learning to initialize its registration policy for stability and quick-start of the training. According to the cross-modal observations, we propose a 2D-3D hybrid state representation that fully exploits the fine-grained features of RGB images while reducing the useless neutral states caused by the spatial truncation of camera frustum. Additionally, the overall framework is well-designed to efficiently reuse one-shot cross-modal embeddings, avoiding repetitive and time-consuming feature extraction. Extensive experiments on the KITTI-Odometry and NuScenes datasets demonstrate that CMR-Agent achieves competitive accuracy and efficiency in registration. Once the one-shot embeddings are completed, each iteration only takes a few milliseconds.
Paper Structure (17 sections, 15 equations, 7 figures, 4 tables)

This paper contains 17 sections, 15 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Illustrations of image-to-point cloud registration. (a) A lightweight system equipped solely with a camera can localize itself by registering the online RGB image and the pre-built LiDAR map. (b) Our CMR-Agent takes the cross-modal observations as inputs, predicting actions (i.e., relative rigid transformations) to iteratively improve registration. The red and green frustums represent the estimated and actual camera poses, respectively.
  • Figure 2: The overall framework of our cross-modal registration agent (CMR-Agent). Parts 2 to 4 on the right constitute the iterative body. The dashed lines indicate the data flow for reusing one-shot embeddings. The bottom right corner is a schematic overview of the entire iterative process.
  • Figure 3: Illustrations of the useful states (top) and neutral states (bottom). The top row shows a fine case where the camera frustums of $\overline{\textbf{T}}$ and $\tilde{\textbf{T}}^k$ overlap with each other. After projection, the geometric features in the depth map $\textbf{D}^{k}$ can indicate actions to adjust the camera pose to align with the RGB image. In the bottom row, $\textbf{D}^{k}$ does not contain any geometric information of the RGB image, which is useless for action prediction.
  • Figure 4: Illustration of the point-to-point distance between the RGB image and the point cloud.
  • Figure 5: Error distributions on two datasets. (a) The sequence 09 of KITTI-Odometry. (b) The sequence 10 of KITTI-Odometry. (c) NuScenes.
  • ...and 2 more figures