Table of Contents
Fetching ...

OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath

Abstract

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

Abstract

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms state-of-the-art open-vocabulary mapping baselines on standard benchmarks.

Paper Structure

This paper contains 18 sections, 13 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of OVI-MAP. Given a streaming RGB-D sequence with camera poses, OVI-MAP incrementally reconstructs a volumetric 3D scene while maintaining a class-agnostic instance map. Semantic information is then assigned in a zero-shot manner using selectively chosen views, enabling open-set object recognition. Our method supports real-time, open-world scene reconstruction with instance-level semantic understanding.
  • Figure 2: Overview of the proposed pipeline. (a) Class-agnostic instance map reconstruction: each RGB-D frame is segmented into entity proposals and refined with geometry-aware depth segmentation. The fused segments are lifted into 3D and incrementally integrated into a global TSDF-based instance map via super-point registration and spatial voting. (b) Incremental semantic feature aggregation: given the global instance map, each instance is re-projected into new frames via depth-guided ray casting. A view selection module identifies informative viewpoints based on object-centric coverage. Selected views are cropped at multiple scales and masked before being passed to a Vision-Language Model (VLM). The resulting features are aggregated per instance, producing stable open-set semantic embeddings.
  • Figure 3: Comparison of view selection strategies. The left illustration shows the pixel-counting strategy takmaz2023openmask3d, which prioritizes frames with larger object masking area, often leading to redundant front-facing views. The right illustration depicts our proposed object-centric view coverage method, which maintains a spherical map of explored viewing directions and selects frames that provide novel perspectives of the object. This yields a more diverse and informative set of viewpoints for semantic feature extraction.
  • Figure 4: Open-vocabulary 3D semantic maps aligned to the ground-truth label sets of the respective datasets. We compare our method with online martins2024ovoslam and offline engelmann2024opennerfguo2024semanticpeng2023openscenehuang2024segment3dtakmaz2023openmask3d approaches on the Replica (a) and ScanNet (b) datasets. Colors correspond to the semantic classes defined in each dataset. Gray regions indicate unobserved areas. OVI-MAP produces spatially coherent and semantically accurate reconstructions, maintaining sharp instance boundaries and consistent semantics throughout incremental mapping. Minor discrepancies in color (e.g., pillows in (a) and the table in (b)) arise from closed-set label mapping, where the ground truth uses alternative class names such as “cushion” or “dining table".
  • Figure 5: Instance highlighting from arbitrary text queries. Given natural language prompts, our system retrieves and highlights corresponding 3D instances based on the learned vision-language embeddings. The examples demonstrate zero-shot grounding of both concrete ("pillow", "toilet") and abstract ("where to sleep", "where is the music") concepts in reconstructed scenes. Darker tones indicate higher cosine similarity between an object and the query.
  • ...and 3 more figures