Table of Contents
Fetching ...

MaFreeI2P: A Matching-Free Image-to-Point Cloud Registration Paradigm with Active Camera Pose Retrieval

Gongxin Yao, Xinyang Li, Yixin Xuan, Yu Pan

TL;DR

The paper tackles image-to-point cloud registration when modality gaps make 2D-3D matching brittle. It introduces MaFreeI2P, a matching-free approach that actively retrieves the camera pose in $SE(3)$ by sampling poses, constructing pose-based cost volumes from cross-modal embeddings, and guiding pose updates with a learned similarity function. Key innovations include a cross-modal pseudo-siamese backbone with circle loss, a pose-based cost-volume formulation, a confidence-weighted similarity estimator, and an iterative refinement loop with shrinking search spaces. Empirical results show state-of-the-art relative translation error and high recall on KITTI-Odometry, with competitive performance on Apollo-DaoxiangLake, demonstrating robustness and practical impact for cross-modal localization and mapping.

Abstract

Image-to-point cloud registration seeks to estimate their relative camera pose, which remains an open question due to the data modality gaps. The recent matching-based methods tend to tackle this by building 2D-3D correspondences. In this paper, we reveal the information loss inherent in these methods and propose a matching-free paradigm, named MaFreeI2P. Our key insight is to actively retrieve the camera pose in SE(3) space by contrasting the geometric features between the point cloud and the query image. To achieve this, we first sample a set of candidate camera poses and construct their cost volume using the cross-modal features. Superior to matching, cost volume can preserve more information and its feature similarity implicitly reflects the confidence level of the sampled poses. Afterwards, we employ a convolutional network to adaptively formulate a similarity assessment function, where the input cost volume is further improved by filtering and pose-based weighting. Finally, we update the camera pose based on the similarity scores, and adopt a heuristic strategy to iteratively shrink the pose sampling space for convergence. Our MaFreeI2P achieves a very competitive registration accuracy and recall on the KITTI-Odometry and Apollo-DaoxiangLake datasets.

MaFreeI2P: A Matching-Free Image-to-Point Cloud Registration Paradigm with Active Camera Pose Retrieval

TL;DR

The paper tackles image-to-point cloud registration when modality gaps make 2D-3D matching brittle. It introduces MaFreeI2P, a matching-free approach that actively retrieves the camera pose in by sampling poses, constructing pose-based cost volumes from cross-modal embeddings, and guiding pose updates with a learned similarity function. Key innovations include a cross-modal pseudo-siamese backbone with circle loss, a pose-based cost-volume formulation, a confidence-weighted similarity estimator, and an iterative refinement loop with shrinking search spaces. Empirical results show state-of-the-art relative translation error and high recall on KITTI-Odometry, with competitive performance on Apollo-DaoxiangLake, demonstrating robustness and practical impact for cross-modal localization and mapping.

Abstract

Image-to-point cloud registration seeks to estimate their relative camera pose, which remains an open question due to the data modality gaps. The recent matching-based methods tend to tackle this by building 2D-3D correspondences. In this paper, we reveal the information loss inherent in these methods and propose a matching-free paradigm, named MaFreeI2P. Our key insight is to actively retrieve the camera pose in SE(3) space by contrasting the geometric features between the point cloud and the query image. To achieve this, we first sample a set of candidate camera poses and construct their cost volume using the cross-modal features. Superior to matching, cost volume can preserve more information and its feature similarity implicitly reflects the confidence level of the sampled poses. Afterwards, we employ a convolutional network to adaptively formulate a similarity assessment function, where the input cost volume is further improved by filtering and pose-based weighting. Finally, we update the camera pose based on the similarity scores, and adopt a heuristic strategy to iteratively shrink the pose sampling space for convergence. Our MaFreeI2P achieves a very competitive registration accuracy and recall on the KITTI-Odometry and Apollo-DaoxiangLake datasets.
Paper Structure (12 sections, 8 equations, 5 figures, 3 tables)

This paper contains 12 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Illustration of the image-to-point cloud registration, where the features are extracted by a pseudo-siamese neural network. (b) The mainstream matching-based pipeline. (c) Our matching-free paradigm compares the 2D features with the 3D features viewed from various camera poses, iteratively updating the output as the pose with highest similarity.
  • Figure 2: The matching-based pipelines only retain the information of 2D-3D matches even if some of them are wrong as shown in (a). Cost volume builds virtual correspondences between 3D points and their projected 2D pixels. Each cost volume unit projects 3D points using a distinct camera pose, as shown in (b) and (c). Consequently, the information between each 3D point and all 2D pixels within its total projection range can be preserved in the whole cost volume, as shown in (d). Besides, projection partially maintains the spatial proximity of 3D points, whereas the wrong matches destroy it.
  • Figure 3: The architecture of MaFreeI2P. Top: The backbone and classification heads are supervised by the circle loss in Eq. \ref{['eq2']} and the focal loss in Eq. \ref{['eq5']}, respectively. Bottom: The iterative branch in Section \ref{['sec:sampling']}$\sim$\ref{['sec:similarity']}. $\mathcal{F}_{s}(\cdot)$ is a light-weight network supervised by the cross-entropy loss in Eq. \ref{['eq8']}.
  • Figure 4: Three visualization examples of image-to-point cloud registration. Green outlines highlight the position of objects on the 2D image. Red outlines highlight the projection of 3D objects in the point cloud, which is transformed by the estimated camera poses. The degree of alignment between the green and red outlines reflects the registration accuracy.
  • Figure 5: Visualization of MaFreeI2P iterations. The ground-truth (green) and estimated (red) camera poses are visualized in the bird's-eye view of the scene point cloud (gray).