OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation
Yinan Deng, Bicheng Yao, Yihang Tang, Yi Yang, Yufeng Yue
TL;DR
OpenVox tackles real-time open-vocabulary 3D mapping for robots by integrating a front-end that enriches instance understanding with caption-encoded language reasoning and a back-end that maintains a probabilistic voxel representation. By decomposing cross-frame fusion into instance association and live map evolution, it achieves robust incremental updates without relying on offline processing. The framework demonstrates state-of-the-art performance in 3D zero-shot instance segmentation, 3D zero-shot semantic segmentation, and open-vocabulary retrieval, validated on both synthetic and real-world onboard experiments. This work provides a practical pathway for open-world robotic perception, enabling reliable instance-level semantics in dynamic environments.
Abstract
In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. To address these challenges, we introduce OpenVox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back-end, we implement probabilistic instance voxels and formulate the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. Furthermore, real-world robotics experiments validate OpenVox's capability for stable, real-time operation.
