Table of Contents
Fetching ...

Quantity-Aware Coarse-to-Fine Correspondence for Image-to-Point Cloud Registration

Gongxin Yao, Yixin Xuan, Yiwei Chen, Yu Pan

TL;DR

The paper addresses image-to-point cloud registration by proposing a quantity-aware coarse-to-fine framework (CFI2P) that learns soft set-to-patch correlations and refines them to point-to-pixel correspondences. It models cross-modal correlation as an optimal transport problem with continuous supervision based on bilateral point-proportions, and uses a hybrid transformer architecture with a confidence-sorting mechanism to progressively improve correspondences. Coarse matching establishes initial set-to-patch mappings, which are then refined through resampling, attention-based learning, and masked optimal transport at the fine level, culminating in efficient RANSAC-based PnP pose estimation. Empirical results on KITTI Odometry and NuScenes show state-of-the-art performance with high inlier ratios, robust to density and resolution gaps, underscoring its practical value for multi-modal perception in robotics and autonomous systems.

Abstract

Image-to-point cloud registration aims to determine the relative camera pose between an RGB image and a reference point cloud, serving as a general solution for locating 3D objects from 2D observations. Matching individual points with pixels can be inherently ambiguous due to modality gaps. To address this challenge, we propose a framework to capture quantity-aware correspondences between local point sets and pixel patches and refine the results at both the point and pixel levels. This framework aligns the high-level semantics of point sets and pixel patches to improve the matching accuracy. On a coarse scale, the set-to-patch correspondence is expected to be influenced by the quantity of 3D points. To achieve this, a novel supervision strategy is proposed to adaptively quantify the degrees of correlation as continuous values. On a finer scale, point-to-pixel correspondences are refined from a smaller search space through a well-designed scheme, which incorporates both resampling and quantity-aware priors. Particularly, a confidence sorting strategy is proposed to proportionally select better correspondences at the final stage. Leveraging the advantages of high-quality correspondences, the problem is successfully resolved using an efficient Perspective-n-Point solver within the framework of random sample consensus (RANSAC). Extensive experiments on the KITTI Odometry and NuScenes datasets demonstrate the superiority of our method over the state-of-the-art methods.

Quantity-Aware Coarse-to-Fine Correspondence for Image-to-Point Cloud Registration

TL;DR

The paper addresses image-to-point cloud registration by proposing a quantity-aware coarse-to-fine framework (CFI2P) that learns soft set-to-patch correlations and refines them to point-to-pixel correspondences. It models cross-modal correlation as an optimal transport problem with continuous supervision based on bilateral point-proportions, and uses a hybrid transformer architecture with a confidence-sorting mechanism to progressively improve correspondences. Coarse matching establishes initial set-to-patch mappings, which are then refined through resampling, attention-based learning, and masked optimal transport at the fine level, culminating in efficient RANSAC-based PnP pose estimation. Empirical results on KITTI Odometry and NuScenes show state-of-the-art performance with high inlier ratios, robust to density and resolution gaps, underscoring its practical value for multi-modal perception in robotics and autonomous systems.

Abstract

Image-to-point cloud registration aims to determine the relative camera pose between an RGB image and a reference point cloud, serving as a general solution for locating 3D objects from 2D observations. Matching individual points with pixels can be inherently ambiguous due to modality gaps. To address this challenge, we propose a framework to capture quantity-aware correspondences between local point sets and pixel patches and refine the results at both the point and pixel levels. This framework aligns the high-level semantics of point sets and pixel patches to improve the matching accuracy. On a coarse scale, the set-to-patch correspondence is expected to be influenced by the quantity of 3D points. To achieve this, a novel supervision strategy is proposed to adaptively quantify the degrees of correlation as continuous values. On a finer scale, point-to-pixel correspondences are refined from a smaller search space through a well-designed scheme, which incorporates both resampling and quantity-aware priors. Particularly, a confidence sorting strategy is proposed to proportionally select better correspondences at the final stage. Leveraging the advantages of high-quality correspondences, the problem is successfully resolved using an efficient Perspective-n-Point solver within the framework of random sample consensus (RANSAC). Extensive experiments on the KITTI Odometry and NuScenes datasets demonstrate the superiority of our method over the state-of-the-art methods.
Paper Structure (24 sections, 23 equations, 10 figures, 5 tables)

This paper contains 24 sections, 23 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Illustration of the coarse-to-fine mechanism and the set-to-patch correlation. (a) Directly matching individual points with pixels is fraught with ambiguity due to the disparate visual attributes. (b) The global ambiguity between similar points and pixels can be resolved through the high-level semantics represented by local point sets and pixel patches. To supervise the matching between the sets and patches, the traditional binary correlation in (c) can be seen as a hard label (i.e., match-or-not), while our quantity-aware correlation in (d) can be seen as a soft label with richer information.
  • Figure 2: Some toy examples for analysing challenges. The solid black line represents the boundary of pixel patches and the colors represent different point sets. (a) and (b) are two cases with dense and sparse points respectively. The points from different sets are projected into the same pixel patch. (c) A case of GeoLoc Geoloc. The brighter points represent the set centers. The pixel patch containing a center point is defined as the sole patch correlated with the corresponding set. (d) A case of 2D3D-MATR li20232d3d. The pixels (black) with a distance of less than $\boldsymbol{\tau}$ from the projected points (blue) are overlapping pixels. If the point-pixel overlap ratio in a set and a patch is greater than $\boldsymbol{\tau}_{o}$, the correlation is determined to be 1, otherwise 0. It is usually biased by the resolution gap.
  • Figure 3: Overview of the proposed CFI2P Framework. 1) The image and point cloud are divided into many non-overlapping regions to extract the local proxies of point sets and pixel patches. 2) Hybrid Transformers and cross-attention are adopted to capture global and cross-modal contexts between the proxies. 3) A differentiable optimal transport algorithm is adopted to match the proxies. 4) After fusing the rich contexts to point and pixel level, we sample n points within the set of each candidate point proxy, and select its top k pixel proxies. Binary sampling masks are produced to guide the subsequent learning. 5) The point-to-pixel correspondences are established by the fine level optimal transport algorithm and confidence sorting strategy.
  • Figure 4: Target of our confidence sorting strategy. Assuming that the point set $G^{\textbf{P}}_j$ and the pixel patch $\textbf{I}_i$ are coarsely matched, we tend to select the points in $G^{\textbf{P}}_j$ that can be projected into $\textbf{I}_i$.
  • Figure 5: Left: The curves of IR across different thresholds. Right: The curves of FMR ($\tau_d$ for IR is 3 pixels) across different thresholds. We obtained the quantitative results on the KITTI Odometry dataset.
  • ...and 5 more figures