Table of Contents
Fetching ...

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Tuo Feng, Wenguan Wang, Ruijie Quan, Yi Yang

TL;DR

Shape2Scene (S2S) addresses the data desert in 3D scene SSL by pretraining on abundant 3D shape data and transferring to scene-level tasks. It introduces multi-scale high-resolution backbones MH-P (point-based) and MH-V (voxel-based), a Shape-to-Scene strategy to compose pseudo scenes from multiple shapes, and a point-point contrastive loss PPC to learn aligned high-resolution representations. The method demonstrates strong transfer across shape-level (ModelNet40 OA 94.6%, ScanObjectNN OA 93.8%, ShapeNetPart Inst. mIoU 87.6%) and scene-level tasks (S3DIS mIoU 74.1%, ScanNet v2 75.8%, SemanticKITTI 71.5%, Synthia4D 84.2%, and 3D detection mAP@0.5 43.9% with MH-VH). This work reduces data collection costs for 3D scene understanding and provides a scalable path toward open-world 3D perception.

Abstract

Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multiscale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to highresolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

TL;DR

Shape2Scene (S2S) addresses the data desert in 3D scene SSL by pretraining on abundant 3D shape data and transferring to scene-level tasks. It introduces multi-scale high-resolution backbones MH-P (point-based) and MH-V (voxel-based), a Shape-to-Scene strategy to compose pseudo scenes from multiple shapes, and a point-point contrastive loss PPC to learn aligned high-resolution representations. The method demonstrates strong transfer across shape-level (ModelNet40 OA 94.6%, ScanObjectNN OA 93.8%, ShapeNetPart Inst. mIoU 87.6%) and scene-level tasks (S3DIS mIoU 74.1%, ScanNet v2 75.8%, SemanticKITTI 71.5%, Synthia4D 84.2%, and 3D detection mAP@0.5 43.9% with MH-VH). This work reduces data collection costs for 3D scene understanding and provides a scalable path toward open-world 3D perception.

Abstract

Current 3D self-supervised learning methods of 3D scenes face a data desert issue, resulting from the time-consuming and expensive collecting process of 3D scene data. Conversely, 3D shape datasets are easier to collect. Despite this, existing pre-training strategies on shape data offer limited potential for 3D scene understanding due to significant disparities in point quantities. To tackle these challenges, we propose Shape2Scene (S2S), a novel method that learns representations of large-scale 3D scenes from 3D shape data. We first design multiscale and high-resolution backbones for shape and scene level 3D tasks, i.e., MH-P (point-based) and MH-V (voxel-based). MH-P/V establishes direct paths to highresolution features that capture deep semantic information across multiple scales. This pivotal nature makes them suitable for a wide range of 3D downstream tasks that tightly rely on high-resolution features. We then employ a Shape-to-Scene strategy (S2SS) to amalgamate points from various shapes, creating a random pseudo scene (comprising multiple objects) for training data, mitigating disparities between shapes and scenes. Finally, a point-point contrastive loss (PPC) is applied for the pre-training of MH-P/V. In PPC, the inherent correspondence (i.e., point pairs) is naturally obtained in S2SS. Extensive experiments have demonstrated the transferability of 3D representations learned by MH-P/V across shape-level and scene-level 3D tasks. MH-P achieves notable performance on well-known point cloud datasets (93.8% OA on ScanObjectNN and 87.6% instance mIoU on ShapeNetPart). MH-V also achieves promising performance in 3D semantic segmentation and 3D object detection.
Paper Structure (23 sections, 1 equation, 3 figures, 9 tables, 1 algorithm)

This paper contains 23 sections, 1 equation, 3 figures, 9 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration for transferring from shape data to scene-level downstream tasks, i.e., Shape2Scene. The Shape-to-Scene strategy aggregates 4 (=$M$) shapes to one pseudo scene. Each shape is resampled and rescaled to fit onto a unit sphere. Blue scores show maximum improvements relative to training-from-scratch models. (S-KITTI stands for SemanticKITTI.)
  • Figure 2: (a) The overview of the MH-P backbone during pre-training, with contrastive loss (Eq. (\ref{['eq:PointInfoNCE']})). (b) The Multi-scale High-resolution (MH) Module of MH-P (see § \ref{['sec:shape1']} for details.) (c) The overview of the MH-V backbone during pre-training, with contrastive loss (Eq. (\ref{['eq:PointInfoNCE']})). (d) The Multi-scale High-resolution (MH) Module of MH-V (see § \ref{['sec:shape2']} for details).
  • Figure A1: Illustration for transferring from shape data to scene-level downstream tasks, i.e., Shape2Scene (§ \ref{['appendix3']}). (a) shows the pre-training of MH-V backbone for semantic segmentation task; (b) shows the downstream semantic segmentation tasks; (c) shows the pre-training of MH-V backbone for object detection task; and (d) shows the downstream object detection tasks. (Best viewed with zoom-in.)