Table of Contents
Fetching ...

Open-Vocabulary Online Semantic Mapping for SLAM

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

TL;DR

The paper tackles open-vocabulary 3D semantic SLAM by introducing OVO, an online pipeline that builds a 3D semantic map of segments labeled with CLIP descriptors. It combines SAM-based 2D masks, a 3D segment mapper, and a novel per-dimension CLIP merging network to fuse descriptors across views, enabling loop-closure-aware, open-set semantic labeling within SLAM backbones. Empirically, OVO achieves superior 3D segmentation metrics on Replica and ScanNetv2 compared to offline and online baselines, while maintaining favorable runtime and memory footprints, even on real-time backbones like ORB-SLAM2. The work demonstrates that learned CLIP merging and open-vocabulary descriptors can generalize to unseen classes, broadening the applicability of semantic SLAM to diverse environments and languages. Overall, OVO bridges online SLAM with open-vocabulary vision-language representations to enable robust, scalable 3D semantic mapping for robotics and AR/VR tasks.

Abstract

This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descriptors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.

Open-Vocabulary Online Semantic Mapping for SLAM

TL;DR

The paper tackles open-vocabulary 3D semantic SLAM by introducing OVO, an online pipeline that builds a 3D semantic map of segments labeled with CLIP descriptors. It combines SAM-based 2D masks, a 3D segment mapper, and a novel per-dimension CLIP merging network to fuse descriptors across views, enabling loop-closure-aware, open-set semantic labeling within SLAM backbones. Empirically, OVO achieves superior 3D segmentation metrics on Replica and ScanNetv2 compared to offline and online baselines, while maintaining favorable runtime and memory footprints, even on real-time backbones like ORB-SLAM2. The work demonstrates that learned CLIP merging and open-vocabulary descriptors can generalize to unseen classes, broadening the applicability of semantic SLAM to diverse environments and languages. Overall, OVO bridges online SLAM with open-vocabulary vision-language representations to enable robust, scalable 3D semantic mapping for robotics and AR/VR tasks.

Abstract

This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descriptors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.

Paper Structure

This paper contains 27 sections, 5 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: OVO mapping. Given a RGB-D set of keyframes (top), our method successively reconstructs a 3D open-vocabulary representation of a scene over time (middle). At any moment, both semantic labels (bottom left) as well as instance labels (bottom right) can be effectively recovered.
  • Figure 2: Overview. From a stream of RGB-D keyframes, OVO builds, online, a 3D semantic representation of the scene. It relies on a 3D segment mapper to cluster 3D points into 3D segments; a queue to distribute the CLIP extraction computation, and a novel CLIP merging method to aggregate CLIP descriptors from multiple keyframes into one for each 3D segment.
  • Figure 3: Out-of-distribution queries. From left to right, top to bottom, observe how common-language queries allow to differentiate bins based on a recycling symbol; recongize sofas and chairs as places to sit; that you can take a nap in a sofa, pillows and couches are soft objects, and books are readable, that the clock tells the hour, the blackboard is to draw equations, and the jacket is something to stay warm. Colorbar shows similarity strength.
  • Figure 4: 3D semantic segmentation on Replica. OVO yields more accurate results in comparison to the two best offline baselines.
  • Figure 5: Visualization of OVO-ORB-SLAM2 loop closure on "scene0011_00" (ScanNet). We highlight four instances split due to tracking drift and effectively merged after loop-closure by our semantic fusion.
  • ...and 3 more figures