Table of Contents
Fetching ...

OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation

Yinan Deng, Bicheng Yao, Yihang Tang, Yi Yang, Yufeng Yue

TL;DR

OpenVox tackles real-time open-vocabulary 3D mapping for robots by integrating a front-end that enriches instance understanding with caption-encoded language reasoning and a back-end that maintains a probabilistic voxel representation. By decomposing cross-frame fusion into instance association and live map evolution, it achieves robust incremental updates without relying on offline processing. The framework demonstrates state-of-the-art performance in 3D zero-shot instance segmentation, 3D zero-shot semantic segmentation, and open-vocabulary retrieval, validated on both synthetic and real-world onboard experiments. This work provides a practical pathway for open-world robotic perception, enabling reliable instance-level semantics in dynamic environments.

Abstract

In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. To address these challenges, we introduce OpenVox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back-end, we implement probabilistic instance voxels and formulate the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. Furthermore, real-world robotics experiments validate OpenVox's capability for stable, real-time operation.

OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation

TL;DR

OpenVox tackles real-time open-vocabulary 3D mapping for robots by integrating a front-end that enriches instance understanding with caption-encoded language reasoning and a back-end that maintains a probabilistic voxel representation. By decomposing cross-frame fusion into instance association and live map evolution, it achieves robust incremental updates without relying on offline processing. The framework demonstrates state-of-the-art performance in 3D zero-shot instance segmentation, 3D zero-shot semantic segmentation, and open-vocabulary retrieval, validated on both synthetic and real-world onboard experiments. This work provides a practical pathway for open-world robotic perception, enabling reliable instance-level semantics in dynamic environments.

Abstract

In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. To address these challenges, we introduce OpenVox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back-end, we implement probabilistic instance voxels and formulate the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. Furthermore, real-world robotics experiments validate OpenVox's capability for stable, real-time operation.

Paper Structure

This paper contains 15 sections, 19 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We introduce OpenVox, a framework of real-time instance-level open-vocabulary probabilistic voxel representation. OpenVox efficiently and robustly reconstructs instance-level maps. The comparison between rendered and detected masks highlights its effectiveness in associating instances across frames (yellow lines) and mitigating missing, under- or over-segmentation (red boxes). The confidence map shows the probability that a voxel belongs to the corresponding instance, providing additional assurance for the map's application in downstream tasks.
  • Figure 2: The framework of OpenVox consists of two main modules: Instance Segmentation & Understanding and Probabilistic Voxel Reconstruction. In the front-end, captions are encoded by LLMs to improve instance understanding. In the back-end, probabilistic modeling ensures the robustness of incremental instance-level mapping. The voxels in the final map are colored based on the instances with the highest probability.
  • Figure 3: A 2D illustration of incremental instance mapping for OpenVox and ConceptGraphs is shown. Probabilistic modeling allows OpenVox to achieve more robust instance association and fusion, while ConceptGraphs conceptgraphs is prone to failure in such cases. These failures will compound subsequent errors in a continuous incremental setting. Note that at time 11 we only show the correlation calculation for the upper half of the region.
  • Figure 4: 3D zero-shot instance segmentation results. The instance colors are randomly assigned and serve solely for differentiation purposes. The probabilistic voxel representation enables OpenVox to accurately segment different instances.
  • Figure 5: 3D zero-shot semantic segmentation results. Comprehensive understanding and weighted updating of instance features enable OpenVox to achieve clear boundaries and accurate semantics.
  • ...and 2 more figures