Table of Contents
Fetching ...

FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators

Haiping Wang, Yuan Liu, Bing Wang, Yujing Sun, Zhen Dong, Wenping Wang, Bisheng Yang

TL;DR

FreeReg addresses cross-modality image-to-point cloud (I2P) registration without task-specific training by unifying modalities through pretrained diffusion models and a monocular depth estimator. It extracts diffusion features from RGB images and depth maps via Stable Diffusion and ControlNet, and augments them with geometric features from Zoe-Depth and FCGF to produce robust, dense correspondences. Pixel-to-point matches are obtained with nearest-neighbor mutual checks and SE(3) pose is recovered with the Kabsch algorithm (or PnP when depth scaling is uncertain). Without task-specific training, FreeReg achieves substantial gains on indoor and outdoor benchmarks, highlighting strong generalization and indicating future speedups and automatic feature selection as promising directions.

Abstract

Matching cross-modality features between images and point clouds is a fundamental problem for image-to-point cloud registration. However, due to the modality difference between images and points, it is difficult to learn robust and discriminative cross-modality features by existing metric learning methods for feature matching. Instead of applying metric learning on cross-modality data, we propose to unify the modality between images and point clouds by pretrained large-scale models first, and then establish robust correspondence within the same modality. We show that the intermediate features, called diffusion features, extracted by depth-to-image diffusion models are semantically consistent between images and point clouds, which enables the building of coarse but robust cross-modality correspondences. We further extract geometric features on depth maps produced by the monocular depth estimator. By matching such geometric features, we significantly improve the accuracy of the coarse correspondences produced by diffusion features. Extensive experiments demonstrate that without any task-specific training, direct utilization of both features produces accurate image-to-point cloud registration. On three public indoor and outdoor benchmarks, the proposed method averagely achieves a 20.6 percent improvement in Inlier Ratio, a three-fold higher Inlier Number, and a 48.6 percent improvement in Registration Recall than existing state-of-the-arts.

FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators

TL;DR

FreeReg addresses cross-modality image-to-point cloud (I2P) registration without task-specific training by unifying modalities through pretrained diffusion models and a monocular depth estimator. It extracts diffusion features from RGB images and depth maps via Stable Diffusion and ControlNet, and augments them with geometric features from Zoe-Depth and FCGF to produce robust, dense correspondences. Pixel-to-point matches are obtained with nearest-neighbor mutual checks and SE(3) pose is recovered with the Kabsch algorithm (or PnP when depth scaling is uncertain). Without task-specific training, FreeReg achieves substantial gains on indoor and outdoor benchmarks, highlighting strong generalization and indicating future speedups and automatic feature selection as promising directions.

Abstract

Matching cross-modality features between images and point clouds is a fundamental problem for image-to-point cloud registration. However, due to the modality difference between images and points, it is difficult to learn robust and discriminative cross-modality features by existing metric learning methods for feature matching. Instead of applying metric learning on cross-modality data, we propose to unify the modality between images and point clouds by pretrained large-scale models first, and then establish robust correspondence within the same modality. We show that the intermediate features, called diffusion features, extracted by depth-to-image diffusion models are semantically consistent between images and point clouds, which enables the building of coarse but robust cross-modality correspondences. We further extract geometric features on depth maps produced by the monocular depth estimator. By matching such geometric features, we significantly improve the accuracy of the coarse correspondences produced by diffusion features. Extensive experiments demonstrate that without any task-specific training, direct utilization of both features produces accurate image-to-point cloud registration. On three public indoor and outdoor benchmarks, the proposed method averagely achieves a 20.6 percent improvement in Inlier Ratio, a three-fold higher Inlier Number, and a 48.6 percent improvement in Registration Recall than existing state-of-the-arts.
Paper Structure (37 sections, 4 equations, 10 figures, 13 tables)

This paper contains 37 sections, 4 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Left: FreeReg unifies the modalities of images and point clouds, which enables mono-modality matching to build cross-modality correspondences. Right: FreeReg does not require any training on the I2P task and is able to register RGB images to point clouds in both indoor and outdoor scenes, even for challenging cases with small overlaps, large viewpoint changes, and sparse point density.
  • Figure 2: To unify the modalities of point clouds (PCs) and images, I: a straightforward way is to generate RGB images from point clouds by depth-to-image diffusion models. However, the generated images usually have large appearance differences from the query images. II: We find that the intermediate features of diffusion models show strong semantic consistency between RGB images and depth maps, resulting in sparse but robust correspondences. III: We further convert RGB images to point clouds by a monocular depth estimator and extract geometric features to match between the input and the generated point clouds, yielding dense but noisy correspondences. IV: We propose to fuse both types of features to build dense and accurate correspondences.
  • Figure 3: FreeReg pipeline. Given a point cloud (PC) and a partially overlapping RGB image, FreeReg extracts diffusion features and geometric features for the point cloud and the image. These two features are fused and matched to establish pixel-to-point correspondences, on which we compute the SE(3) relative pose between the image and the point cloud.
  • Figure 4: Diffusion feature extraction on (a) images and (b) depth maps. (c) Visualization of diffusion features.
  • Figure 5: Visualization of features and estimated correspondences. (a) Input images and point clouds. (b), (c), and (d) show the visualization of diffusion, geometric, and fused feature maps respectively. (e), (f), and (g) show the pixel-to-point correspondences estimated by the nearest neighbor (NN) matcher using diffusion, geometric, and fused features respectively. Diffusion features estimate reliable but sparse correspondences. Geometric features yield dense matches but with more outliers. Fused features strike a balance between accuracy and preserving fine-grained details, resulting in accurate and dense matches.
  • ...and 5 more figures