Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos
Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi
TL;DR
Ov3R tackles the challenge of open-vocabulary semantic 3D reconstruction from RGB video by coupling a CLIP-informed 3R reconstruction module with a 2D-3D open-vocabulary segmentation module. The CLIP3R component injects object-level CLIP semantics into dense 3D pointmaps, while 2D-3D OVS lifts 2D features into fused descriptors that align with text embeddings for open-set labeling. Across Replica, 7Scenes, and ScanNetv2, Ov3R delivers state-of-the-art reconstruction fidelity and competitive, semantics-aware segmentation with real-time performance, demonstrating the viability of RGB-only, semantics-enabled Spatial AI systems. The framework achieves strong semantic consistency and fine-grained segmentation without predefined vocabularies, marking a step toward real-time, semantics-rich Spatial AI systems.
Abstract
We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.
