ZeroReg: Zero-Shot Point Cloud Registration with Foundation Models

Weijie Wang; Wenqi Ren; Guofeng Mei; Bin Ren; Xiaoshui Huang; Fabio Poiesi; Nicu Sebe; Bruno Lepri

ZeroReg: Zero-Shot Point Cloud Registration with Foundation Models

Weijie Wang, Wenqi Ren, Guofeng Mei, Bin Ren, Xiaoshui Huang, Fabio Poiesi, Nicu Sebe, Bruno Lepri

TL;DR

ZeroReg tackles the problem of zero-shot point cloud registration by leveraging 2D foundation-model semantics to locate and match objects across views, forming scene graphs to resolve semantic ambiguities without requiring 3D training data. The method detects and segments objects with Florence-2 and SAMv2, extracts CLIP-based semantic features, and back-projects them to 3D space; object-level correspondences are found via graph matching on scene graphs, while point-level correspondences are refined within matched regions using SuperGlue/LoFTR and RANSAC. The approach demonstrates competitive performance on 3DMatch, 3DLoMatch, and ScanNet, illustrating strong generalization in data-scarce scenarios and reduced reliance on 3D annotations. While promising, the work acknowledges a modality gap between 2D foundation-model pretraining and 3D scenes and identifies this as a key area for future improvement to further close the performance gap in zero-shot PCR.

Abstract

State-of-the-art 3D point cloud registration methods rely on labeled 3D datasets for training, which limits their practical applications in real-world scenarios and often hinders generalization to unseen scenes. Leveraging the zero-shot capabilities of foundation models offers a promising solution to these challenges. In this paper, we introduce ZeroReg, a zero-shot registration approach that utilizes 2D foundation models to predict 3D correspondences. Specifically, ZeroReg adopts an object-to-point matching strategy, starting with object localization and semantic feature extraction from multi-view images using foundation models. In the object matching stage, semantic features help identify correspondences between objects across views. However, relying solely on semantic features can lead to ambiguity, especially in scenes with multiple instances of the same category. To address this, we construct scene graphs to capture spatial relationships among objects and apply a graph matching algorithm to these graphs to accurately identify matched objects. Finally, computing fine-grained point-level correspondences within matched object regions using algorithms like SuperGlue and LoFTR achieves robust point cloud registration. Evaluations on benchmarks such as 3DMatch, 3DLoMatch, and ScanNet demonstrate ZeroReg's competitive performance, highlighting its potential to advance point-cloud registration by integrating semantic features from foundation models.

ZeroReg: Zero-Shot Point Cloud Registration with Foundation Models

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 3 figures, 5 tables)

This paper contains 17 sections, 5 equations, 3 figures, 5 tables.

Introduction
Related Work
Single-Modal Methods
Multi-Modal Methods
Vision Understanding via Foundation Models.
Method
Problem Statement
Object Localization and Feature Extraction
Graph-Based Object-Level Matching
Semantic-Guided Point-Level Matching
Experiments
Experiments Setup
Comparisons on 3DMatch & 3DLoMatch
Comparison on ScanNet
Ablation Studies & Analysis
...and 2 more sections

Figures (3)

Figure 1: Key Motivation. (1) We achieve zero-shot PCR by identifying matched objects based on semantic similarity using CLIP radford2021learning, which is pretrained on a large-scale image-text dataset. (2) An object-centric scene graph is constructed to capture spatial relationships within the point clouds, resolving semantic ambiguities caused by multiple instances of the same category. Notably, our approach does not require additional 3D data training.
Figure 2: The ZeroReg framework begins with segmentation and detection of source and target multiviews using foundational visual models, along with geometric feature extraction through feature matching algorithms. The generated multi-view object masks are then filtered and averaged, with semantic features extracted using the CLIP text encoder in parallel with geometric feature processing, before being projected onto the point cloud. A scene graph is constructed to facilitate object-level and point-level matching, establishing correspondences between the source and target. Based on these correspondences, the transformation is calculated, and RANSAC is applied for optimization, achieving precise registration.
Figure 3: Visualization of the entire process in ZeroReg on 3DMatch. For object-level matching, the matched regions are highlighted using the same color. For point-level matching with yellow and blue points blue and yellow points are used to visualize the correspondences.

ZeroReg: Zero-Shot Point Cloud Registration with Foundation Models

TL;DR

Abstract

ZeroReg: Zero-Shot Point Cloud Registration with Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)