Table of Contents
Fetching ...

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Zhigang Wang, Yifei Su, Chenhui Li, Dong Wang, Yan Huang, Bin Zhao, Xuelong Li

TL;DR

Octree-Graph is proposed, a novel scene representation for open-vocabulary 3D scene understanding that stores semantics and depicts the occupancy of an object adjustably according to its shape using an adaptive-octree structure.

Abstract

Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method.

Open-Vocabulary Octree-Graph for 3D Scene Understanding

TL;DR

Octree-Graph is proposed, a novel scene representation for open-vocabulary 3D scene understanding that stores semantics and depicts the occupancy of an object adjustably according to its shape using an adaptive-octree structure.

Abstract

Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: (a) A 3D scene. (b) The corresponding semantic 3D map based on point clouds (6.8M). (c) Our Octree-Graph where each object is represented by the proposed adaptive-octree and each edge contains rich spatial relations among objects. All adaptive-octrees occupy 42KB of storage space in total.
  • Figure 2: Overview of our Octree-Graph. (a) Chronological Group-wise Segment Merging (CGSM). Given posed RGB-D inputs, 2D masks with semantic features are first extracted and then projected into the 3D space where CGSM is conducted to merge segments. (b) Instance Feature Aggregation (IFA). Feature aggregation is performed for each merged object, which considers both intra- and inter-object similarity. (c) The Octree-Graph is constructed to efficiently and accurately represent the scene, facilitating various downstream tasks.
  • Figure 3: Illustration of the nodes and edges in Octree-Graph.
  • Figure 4: Illustration of the construction of the adaptive-octree. The above displays the process, and the below shows an example.
  • Figure 5: Visual comparisons. (a) Semantic segmentation results on Replica. (b) Instance segmentation results on ScanNet200.
  • ...and 2 more figures