Table of Contents
Fetching ...

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Chenming Zhu, Wenwei Zhang, Tai Wang, Xihui Liu, Kai Chen

TL;DR

Open-vocabulary 3D object detection in point clouds is hindered by limited annotations and weak 2D image-text bridging. Object2Scene augments 3D scenes with large-vocabulary objects from 3D datasets and generates language grounding prompts, while L3Det unifies 3D detection and grounding under a transformer-based framework and is trained with a cross-domain contrastive loss. The approach optimizes a combination of losses, $\mathcal{L}_{loc}$, $\mathcal{L}_{align}$, and $\mathcal{L}_{cl}$, where $\mathcal{L}_{cl}$ is defined as $\mathcal{L}_{cl}= -\frac{1}{N}\sum_{i=1}^N\log\frac{\sum_{i\neq j, y_i=y_j, j=0,\cdots,N}\exp(f_i^\top f_j/\tau)}{\sum_{i\neq k, k=0,\cdots,N}^N\exp(f_i^\top f_k/\tau)}$. Experiments on OV-ScanNet20, OV-SUN RGB-D20, and OV-ScanNet200 show state-of-the-art open-vocabulary 3D detection and strong cross-dataset transfer, validating the practicality of end-to-end open-world 3D perception without relying on 2D image correspondences.

Abstract

Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. It is extremely challenging because of the limited data and annotations (bounding boxes with class labels or text descriptions) of 3D scenes. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics but require an extra alignment process between 2D images and 3D points, limiting the open-vocabulary ability of 3D detectors. Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts objects from different sources into 3D scenes to enrich the vocabulary of 3D scene datasets and generates text descriptions for the newly inserted objects. We further introduce a framework that unifies 3D detection and visual grounding, named L3Det, and propose a cross-domain category-level contrastive learning approach to mitigate the domain gap between 3D objects from different datasets. Extensive experiments on existing open-vocabulary 3D object detection benchmarks show that Object2Scene obtains superior performance over existing methods. We further verify the effectiveness of Object2Scene on a new benchmark OV-ScanNet-200, by holding out all rare categories as novel categories not seen during training.

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

TL;DR

Open-vocabulary 3D object detection in point clouds is hindered by limited annotations and weak 2D image-text bridging. Object2Scene augments 3D scenes with large-vocabulary objects from 3D datasets and generates language grounding prompts, while L3Det unifies 3D detection and grounding under a transformer-based framework and is trained with a cross-domain contrastive loss. The approach optimizes a combination of losses, , , and , where is defined as . Experiments on OV-ScanNet20, OV-SUN RGB-D20, and OV-ScanNet200 show state-of-the-art open-vocabulary 3D detection and strong cross-dataset transfer, validating the practicality of end-to-end open-world 3D perception without relying on 2D image correspondences.

Abstract

Point cloud-based open-vocabulary 3D object detection aims to detect 3D categories that do not have ground-truth annotations in the training set. It is extremely challenging because of the limited data and annotations (bounding boxes with class labels or text descriptions) of 3D scenes. Previous approaches leverage large-scale richly-annotated image datasets as a bridge between 3D and category semantics but require an extra alignment process between 2D images and 3D points, limiting the open-vocabulary ability of 3D detectors. Instead of leveraging 2D images, we propose Object2Scene, the first approach that leverages large-scale large-vocabulary 3D object datasets to augment existing 3D scene datasets for open-vocabulary 3D object detection. Object2Scene inserts objects from different sources into 3D scenes to enrich the vocabulary of 3D scene datasets and generates text descriptions for the newly inserted objects. We further introduce a framework that unifies 3D detection and visual grounding, named L3Det, and propose a cross-domain category-level contrastive learning approach to mitigate the domain gap between 3D objects from different datasets. Extensive experiments on existing open-vocabulary 3D object detection benchmarks show that Object2Scene obtains superior performance over existing methods. We further verify the effectiveness of Object2Scene on a new benchmark OV-ScanNet-200, by holding out all rare categories as novel categories not seen during training.
Paper Structure (29 sections, 1 equation, 7 figures, 9 tables)

This paper contains 29 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: By utilizing the 3D object datasets, Object2Scene empowers the 3D detector (L3Det) with strong open-vocabulary capability. The training process is shown on the left, where Object2Scene generates training data for L2Det by inserting unseen objects into the 3D scene and generating grounding prompts for the inserted objects. The inference process is shown on the right.
  • Figure 2: Overall pipeline for Object2Scene. The objects are sampled from 3D object datasets and inserted into the real-scanned scene. Then we generate grounding prompts for the inserted objects.
  • Figure 3: Open-vocabulary 3D object detection with Object2Scene. The figure on the left-hand side shows the model architecture of L3Det. The figure on the right-hand side shows the cross-domain category-level contrastive learning approach. Given 6 objects illustrated in the figure, the contrastive loss brings together (denoted by "1" in the matrix) the features of the three objects belonging to the category "chair", despite the fact that they are from different source datasets. The object features of different categories are pushed away from each other (denoted by "0" in the matrix).
  • Figure 4: Qualitative results for open-vocabulary 3D object detection results. For each scene, the detection prompt is shown under the input point cloud. The colors of bounding boxes correspond to the classes in the prompts.
  • Figure 5: Sentence decoupling illustration.
  • ...and 2 more figures