Table of Contents
Fetching ...

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

Ziren Gong, Xiaohan Li, Fabio Tosi, Jiawei Han, Stefano Mattoccia, Jianfei Cai, Matteo Poggi

TL;DR

Ov3R tackles the challenge of open-vocabulary semantic 3D reconstruction from RGB video by coupling a CLIP-informed 3R reconstruction module with a 2D-3D open-vocabulary segmentation module. The CLIP3R component injects object-level CLIP semantics into dense 3D pointmaps, while 2D-3D OVS lifts 2D features into fused descriptors that align with text embeddings for open-set labeling. Across Replica, 7Scenes, and ScanNetv2, Ov3R delivers state-of-the-art reconstruction fidelity and competitive, semantics-aware segmentation with real-time performance, demonstrating the viability of RGB-only, semantics-enabled Spatial AI systems. The framework achieves strong semantic consistency and fine-grained segmentation without predefined vocabularies, marking a step toward real-time, semantics-rich Spatial AI systems.

Abstract

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

TL;DR

Ov3R tackles the challenge of open-vocabulary semantic 3D reconstruction from RGB video by coupling a CLIP-informed 3R reconstruction module with a 2D-3D open-vocabulary segmentation module. The CLIP3R component injects object-level CLIP semantics into dense 3D pointmaps, while 2D-3D OVS lifts 2D features into fused descriptors that align with text embeddings for open-set labeling. Across Replica, 7Scenes, and ScanNetv2, Ov3R delivers state-of-the-art reconstruction fidelity and competitive, semantics-aware segmentation with real-time performance, demonstrating the viability of RGB-only, semantics-enabled Spatial AI systems. The framework achieves strong semantic consistency and fine-grained segmentation without predefined vocabularies, marking a step toward real-time, semantics-rich Spatial AI systems.

Abstract

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

Paper Structure

This paper contains 13 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Ov3R is an Open-Vocabulary Semantic 3D Reconstruction Framework. It consists of two novel feed-forward modules, CLIP3R and 2D–3D OVS, and excels in both 3D reconstruction and open-vocabulary 3D semantic segmentation.
  • Figure 2: Overview of Ov3R. Given RGB-only videos, we first apply CLIP3R to produce scene points while SAM predicts 2D segments. Each 2D segment is matched to its corresponding 3D points to obtain 3D semantics. Next, the 2D-3D OVS extracts the fused 2D-3D descriptor to compute the cosine similarity with the text embeddings corresponding to a set of semantic classes.
  • Figure 3: CLIP3R Overview. I2P integrates object-level CLIP features with visual embeddings to predict local pointmaps. L2W then aligns these local pointmaps to global scene coordinates while predicting object-level CLIP3R features with a DPT head in scene space.
  • Figure 4: 2D-3D OVS Overview. After matching 2D and 3D segments across images and pointmaps, CLIP3R, DINO, and 3D-CLIP features are combined into a 2D-3D fused descriptor, on top of which open-vocabulary semantic segmentation is performed. CLIP3R and DINO features are processed both at scene and instance levels. Meanwhile, 3D-CLIP features are extracted from masked 3D object points.
  • Figure 5: Qualitative results -- Dense Pointmaps on Replica. Compared to competing methods, Ov3R demonstrates superior completeness and geometric alignment, particularly visible in the reconstruction of chairs (blue) and desks (yellow).
  • ...and 1 more figures