ZeroReg: Zero-Shot Point Cloud Registration with Foundation Models
Weijie Wang, Wenqi Ren, Guofeng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Nicu Sebe, Bruno Lepri
TL;DR
ZeroReg tackles the problem of zero-shot point cloud registration by leveraging 2D foundation-model semantics to locate and match objects across views, forming scene graphs to resolve semantic ambiguities without requiring 3D training data. The method detects and segments objects with Florence-2 and SAMv2, extracts CLIP-based semantic features, and back-projects them to 3D space; object-level correspondences are found via graph matching on scene graphs, while point-level correspondences are refined within matched regions using SuperGlue/LoFTR and RANSAC. The approach demonstrates competitive performance on 3DMatch, 3DLoMatch, and ScanNet, illustrating strong generalization in data-scarce scenarios and reduced reliance on 3D annotations. While promising, the work acknowledges a modality gap between 2D foundation-model pretraining and 3D scenes and identifies this as a key area for future improvement to further close the performance gap in zero-shot PCR.
Abstract
State-of-the-art 3D point cloud registration methods rely on labeled 3D datasets for training, which limits their practical applications in real-world scenarios and often hinders generalization to unseen scenes. Leveraging the zero-shot capabilities of foundation models offers a promising solution to these challenges. In this paper, we introduce ZeroReg, a zero-shot registration approach that utilizes 2D foundation models to predict 3D correspondences. Specifically, ZeroReg adopts an object-to-point matching strategy, starting with object localization and semantic feature extraction from multi-view images using foundation models. In the object matching stage, semantic features help identify correspondences between objects across views. However, relying solely on semantic features can lead to ambiguity, especially in scenes with multiple instances of the same category. To address this, we construct scene graphs to capture spatial relationships among objects and apply a graph matching algorithm to these graphs to accurately identify matched objects. Finally, computing fine-grained point-level correspondences within matched object regions using algorithms like SuperGlue and LoFTR achieves robust point cloud registration. Evaluations on benchmarks such as 3DMatch, 3DLoMatch, and ScanNet demonstrate ZeroReg's competitive performance, highlighting its potential to advance point-cloud registration by integrating semantic features from foundation models.
