Table of Contents
Fetching ...

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Kashu Yamazaki, Taisei Hanyu, Khoa Vo, Thang Pham, Minh Tran, Gianfranco Doretto, Anh Nguyen, Ngan Le

TL;DR

Open-Fusion tackles real-time open-vocabulary 3D mapping by integrating region-level embeddings from a vision-language foundation model with a TSDF-based reconstruction pipeline. The method uses SEEM to obtain region-level embeddings and confidence maps, stores embedding keys in a compact dictionary, and employs an enhanced Hungarian (Jonker-Volgenant) matching to fuse semantic information into the 3D map via a semantic TSDF volume $V_t = \{G_i\}_{i=1}^M$ consisting of blocks $G_i = \{p_j\}_{j=1}^{r^3}$ and voxel attributes $(RGB_j, w_j, \phi_j, k_j, c_j)$. The two main modules enable real-time 3D scene reconstruction and open-vocabulary querying, with semantic updates synchronized to frame streams. Empirical results on ScanNet show Open-Fusion achieving ~4.5 FPS with competitive mAcc and f-mIoU while outperforming baselines in speed by up to ~30x; qualitative results on Replica and a Kobuki-based real-world test confirm accurate open-vocabulary segmentation and practical applicability. This work advances real-time, open-world semantic mapping for robotics by combining region-based VLFM semantics with efficient TSDF-based reconstruction and query capabilities, reducing memory and computation through an embedding dictionary and region-level fusion.

Abstract

Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

TL;DR

Open-Fusion tackles real-time open-vocabulary 3D mapping by integrating region-level embeddings from a vision-language foundation model with a TSDF-based reconstruction pipeline. The method uses SEEM to obtain region-level embeddings and confidence maps, stores embedding keys in a compact dictionary, and employs an enhanced Hungarian (Jonker-Volgenant) matching to fuse semantic information into the 3D map via a semantic TSDF volume consisting of blocks and voxel attributes . The two main modules enable real-time 3D scene reconstruction and open-vocabulary querying, with semantic updates synchronized to frame streams. Empirical results on ScanNet show Open-Fusion achieving ~4.5 FPS with competitive mAcc and f-mIoU while outperforming baselines in speed by up to ~30x; qualitative results on Replica and a Kobuki-based real-world test confirm accurate open-vocabulary segmentation and practical applicability. This work advances real-time, open-world semantic mapping for robotics by combining region-based VLFM semantics with efficient TSDF-based reconstruction and query capabilities, reducing memory and computation through an embedding dictionary and region-level fusion.

Abstract

Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion
Paper Structure (14 sections, 7 equations, 3 figures, 2 tables)

This paper contains 14 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The overall pipeline of Open-Fusion, which contains two modules. Real-time Semantic TSDF 3D Scene Reconstruction Module: This module takes in a stream of RGB-D images ($\textbf{I}_t$, $\textbf{D}_t$) and the corresponding camera pose ($\mathbf{A}_t$). It incrementally reconstructs the 3D scene, representing it as a semantic TSDF volume $V_t$ at time $t$. Open-Vocabulary Query and Scene Understanding Module: In the second module, Open-Fusion accepts open-vocab queries as inputs and provides corresponding scene segmentations in response, which can serve as an eye for language base robot commanding.
  • Figure 2: Qualitative comparison of 3D object query results on Replica dataset. While ConceptFusion failed to pinpoint the object location, Open-Fusion can estimate more precise location from language queries.
  • Figure 3: The Kobuki platform is equipped with an Azure Kinect Camera and an Intel T265 Camera to demonstrate real-time mapping in a real-world environment. This system enables interaction with the world through natural language queries. The system is able to highlight the novel objects like the "quadruped robot" or "chicken taxidermy".