Table of Contents
Fetching ...

KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud, Malik Mohrat, Ilya Obrubov, Ekaterina Derevyanka, Ivan Sosin, Sergey Kolyubin

TL;DR

KM-ViPE addresses online open-vocabulary SLAM with uncalibrated monocular cameras in dynamic ego-centric environments by fusing dense DINO embeddings with geometric constraints in a tightly coupled optimization. It introduces a dense bundle adjustment with an embedding similarity term and an adaptive robust kernel to handle moving and movable objects. The system leverages internet-scale training data to provide open vocabulary semantics via a Talk2DINO mapping, enabling language-grounded queries of the 3D map in real time. The results show competitive SLAM accuracy in dynamic scenes and enable 3D semantic mapping without depth sensors or prior camera calibration, supporting robotics and AR/VR applications.

Abstract

We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.

KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

TL;DR

KM-ViPE addresses online open-vocabulary SLAM with uncalibrated monocular cameras in dynamic ego-centric environments by fusing dense DINO embeddings with geometric constraints in a tightly coupled optimization. It introduces a dense bundle adjustment with an embedding similarity term and an adaptive robust kernel to handle moving and movable objects. The system leverages internet-scale training data to provide open vocabulary semantics via a Talk2DINO mapping, enabling language-grounded queries of the 3D map in real time. The results show competitive SLAM accuracy in dynamic scenes and enable 3D semantic mapping without depth sensors or prior camera calibration, supporting robotics and AR/VR applications.

Abstract

We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth data and/or odometry estimation, or lack dynamic scene robustness. KM-ViPE benefits from internet-scale training and uniquely combines online operation, uncalibrated monocular input, and robust handling of dynamic scenes, which makes it a good fit for autonomous robotics and AR/VR applications and advances practical spatial intelligence capabilities for embodied AI.

Paper Structure

This paper contains 24 sections, 24 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: KM-ViPE: a real-time open-vocabulary SLAM framework for uncalibrated monocular RGB cameras in dynamic environments for Ego-Centric applications.
  • Figure 2: System pipeline
  • Figure 3: Adaptive robust kernels based on Barron's function barron in Bundle Adjustment. Shape parameter $\alpha$ is determined by the cosine similarity between multiview high-level visual features dinov2
  • Figure 4: 3D Replica Point Cloud with aligned fused high level embeddings, visualized using PCA colorization