Table of Contents
Fetching ...

Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment

Kangcheng Liu, Yong-Jin Liu, Baoquan Chen

TL;DR

WS3D++ tackles open-vocabulary and data-efficient 3D scene parsing by coupling hierarchical vision-language pre-training with region-aware fine-tuning. It uses multi-view rendering to establish explicit 2D-3D associations and distills knowledge from vision-language models into a 3D backbone via a KL-divergence loss, while employing region-level energy-based and contrastive losses for unlabeled data. The approach delivers state-of-the-art open-world and data-efficient performance on indoor and outdoor benchmarks for semantic and instance segmentation and object detection, validating strong cross-task generalization. By enabling language-driven, open-vocabulary 3D perception with reduced labeling requirements, WS3D++ offers practical benefits for robotics and perception systems operating in diverse environments.

Abstract

Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world applications. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. The code is made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.

Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment

TL;DR

WS3D++ tackles open-vocabulary and data-efficient 3D scene parsing by coupling hierarchical vision-language pre-training with region-aware fine-tuning. It uses multi-view rendering to establish explicit 2D-3D associations and distills knowledge from vision-language models into a 3D backbone via a KL-divergence loss, while employing region-level energy-based and contrastive losses for unlabeled data. The approach delivers state-of-the-art open-world and data-efficient performance on indoor and outdoor benchmarks for semantic and instance segmentation and object detection, validating strong cross-task generalization. By enabling language-driven, open-vocabulary 3D perception with reduced labeling requirements, WS3D++ offers practical benefits for robotics and perception systems operating in diverse environments.

Abstract

Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world applications. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. The code is made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.
Paper Structure (15 sections, 4 equations, 12 figures, 6 tables)

This paper contains 15 sections, 4 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: The final overall illustrative diagram of our proposed WS3D++. We integrate language-3D feature associated pre-training and data-efficient fine-tuning as a general scene parsing vision-language model to achieve effective data-efficient as well as open-vocabulary 3D scene understanding for 3D scenes.
  • Figure 2: The pre-training paradigm of our proposed WS3D++. We propose the hierarchical global to local feature alignments to establish the hierarchical vision-language aligned feature representations during the pre-training. This proposed paradigm helps to learn more powerful visual-linguistic aligned feature representation during the pre-training stage. We have further shown the final visualizations and comparisons with CLIP text presentations ranging from both the global view level to the local object category level. The results have further demonstrate the vision-language aligned feature representation for 3D scene parsing.
  • Figure 3: The feature matching visualization of our proposed WS3D++. We propose hierarchical global to local feature alignments to establish hierarchical vision-language aligned feature representations during pre-training from both the global view level to the local object level. This kind of paradigm helps to learn more powerful visual-linguistic matched representations ranging from both the global view-level to the local object category-level. In the above figure, we have shown the matching at the global view on the left and the matching at local object level on the right. It can be demonstrated that our proposed approach can establish matched feature representation at both the global room feature level and the local object feature level.
  • Figure 4: The fine-tuning paradigm of our proposed WS3D++. WS3D++liu2022weakly consists of three proposed modules: 1. The unsupervised region-level energy-based optimization guided by boundary labels; 2. The unsupervised multi-stage region-level contrastive learning with high confidence; 3. The supervised region-level semantic contrastive learning with labeled data. The backbone network adopts encoder-decoder structures. The weights of the backbone network are shared in the supervised and unsupervised branches. Integrated with the proposed pre-training paradigm illustrated in Figure \ref{['fig_frammwork_language']}, by our proposed hierarchical feature aligned pre-training and regional fine-tuning, more effective label-efficient learning as well as open-vocabulary learning is realized.
  • Figure 6: The captured 2D and 3D region proposals. It is demonstrated qualitatively clearly that more precise object proposals are captured by proposed united 2D/3D proposal generation approach. It can be demonstrated that clear superior regional proposal generation performance can be well guaranteed.
  • ...and 7 more figures