Table of Contents
Fetching ...

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Jihan Yang, Runyu Ding, Weipeng Deng, Zhe Wang, Xiaojuan Qi

TL;DR

RegionPLC tackles open-world 3D scene understanding by constructing dense region-level 3D–language pairs from multiple 2D vision-language foundation models. It introduces SFusion to merge diverse region captions and a region-aware point-discriminative contrastive loss to learn robust, discriminative 3D representations from language supervision. The method achieves state-of-the-art gains on ScanNet, ScanNet200, and nuScenes, particularly in unseen categories, while remaining scalable and resource-efficient. Furthermore, RegionPLC can integrate with large language models (RegionGR) for open-ended grounded 3D reasoning without task-specific 3D data, highlighting practical impact for real-world open-world perception and reasoning.

Abstract

We propose a lightweight and scalable Regional Point-Language Contrastive learning framework, namely \textbf{RegionPLC}, for open-world 3D scene understanding, aiming to identify and recognize open-set objects and categories. Specifically, based on our empirical studies, we introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models, yielding high-quality, dense region-level language descriptions without human 3D annotations. Subsequently, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from dense regional language supervision. We carry out extensive experiments on ScanNet, ScanNet200, and nuScenes datasets, and our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2\% and 9.1\% for semantic and instance segmentation, respectively, while maintaining greater scalability and lower resource demands. Furthermore, our method has the flexibility to be effortlessly integrated with language models to enable open-ended grounded 3D reasoning without extra task-specific training. Code is available at https://github.com/CVMI-Lab/PLA.

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

TL;DR

RegionPLC tackles open-world 3D scene understanding by constructing dense region-level 3D–language pairs from multiple 2D vision-language foundation models. It introduces SFusion to merge diverse region captions and a region-aware point-discriminative contrastive loss to learn robust, discriminative 3D representations from language supervision. The method achieves state-of-the-art gains on ScanNet, ScanNet200, and nuScenes, particularly in unseen categories, while remaining scalable and resource-efficient. Furthermore, RegionPLC can integrate with large language models (RegionGR) for open-ended grounded 3D reasoning without task-specific 3D data, highlighting practical impact for real-world open-world perception and reasoning.

Abstract

We propose a lightweight and scalable Regional Point-Language Contrastive learning framework, namely \textbf{RegionPLC}, for open-world 3D scene understanding, aiming to identify and recognize open-set objects and categories. Specifically, based on our empirical studies, we introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models, yielding high-quality, dense region-level language descriptions without human 3D annotations. Subsequently, we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from dense regional language supervision. We carry out extensive experiments on ScanNet, ScanNet200, and nuScenes datasets, and our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2\% and 9.1\% for semantic and instance segmentation, respectively, while maintaining greater scalability and lower resource demands. Furthermore, our method has the flexibility to be effortlessly integrated with language models to enable open-ended grounded 3D reasoning without extra task-specific training. Code is available at https://github.com/CVMI-Lab/PLA.
Paper Structure (28 sections, 5 equations, 6 figures, 14 tables)

This paper contains 28 sections, 5 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Overview of our regional point-language contrastive learning framework. For regional 3D-language association, We develop a 3D-aware SFusion strategy effectively combining 3D vision-language pairs obtained from multiple 2D foundation models (refer to Sec. \ref{['sec:region_prompted_language']}). Upon these 3D-language data, we propose region-aware point-discriminative contrastive learning to facilitate more distinctive and robust representation learning (detailed in Sec. \ref{['sec:pdc_loss']}). Different point & box colors in the bottom-right indicate various 3D-caption pairs.
  • Figure 2: Comparisons of different advanced manners for extracting regional language descriptions with 2D foundation models.
  • Figure 3: Qualitative results of our RegionPLC. The examples above show annotation-free open-world scene parsing where no human annotation is available (see (a)), and base-annotated open-world learning where a limited number of base classes are annotated (see (b), (c), (d)) for semantic and instance segmentation covering both indoor and outdoor scenarios. Unseen categories are highlighted in colors.
  • Figure 4: (a) Visualizations of RegionGR that integrates LLM for open-ended grounded 3D reasoning. (b) Demonstrating the versatility of RegionGR via more examples of answering user queries.
  • Figure 5: Qualitative results of annotation-free semantic segmentation on ScanNet.
  • ...and 1 more figures