Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

Christina Kassab; Matias Mattamala; Lintong Zhang; Maurice Fallon

Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

Christina Kassab, Matias Mattamala, Lintong Zhang, Maurice Fallon

TL;DR

LEXIS addresses the limitation of fixed-class semantic models in indoor SLAM by integrating open-vocabulary CLIP features into a real-time topological pose graph. The system jointly supports room segmentation, room-aware place recognition, and semantic loop closure using a single pre-trained model, enabling flexible scene understanding without extensive retraining. Experiments on simulated and real multi-floor datasets show improved room segmentation accuracy, competitive place recognition, and SLAM performance comparable to state-of-the-art methods, with a demonstrated planning capability. This work highlights the potential of open-vocabulary language models to enhance automatic interaction with indoor environments and informs future directions toward dense reconstruction and uncertainty-aware long-term operation.

Abstract

Versatile and adaptive semantic understanding would enable autonomous systems to comprehend and interact with their surroundings. Existing fixed-class models limit the adaptability of indoor mobile and assistive autonomous systems. In this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and Mapping (SLAM) system that harnesses the open-vocabulary nature of Large Language Models (LLMs) to create a unified approach to scene understanding and place recognition. The approach first builds a topological SLAM graph of the environment (using visual-inertial odometry) and embeds Contrastive Language-Image Pretraining (CLIP) features in the graph nodes. We use this representation for flexible room classification and segmentation, serving as a basis for room-centric place recognition. This allows loop closure searches to be directed towards semantically relevant places. Our proposed system is evaluated using both public, simulated data and real-world data, covering office and home environments. It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA). For place recognition and trajectory estimation tasks we achieve equivalent performance to the SOTA, all also utilizing the same pre-trained model. Lastly, we demonstrate the system's potential for planning.

Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 7 figures, 3 tables)

This paper contains 17 sections, 2 equations, 7 figures, 3 tables.

Introduction
Related Work
Semantic Scene Representations
-powered Representations
Method
Front-end
Room Estimation and Refinement
Clustering
Semantic Loop Closure Detection
Experiments and Results
Experimental Setup
Results
Room Segmentation and Classification
Semantic Place Recognition
Full System Evaluation
...and 2 more sections

Figures (7)

Figure 1: LEXIS enables pose graph segmentation from natural language. By exploiting the open-vocabulary capabilities of , we can segment room instances such as office, kitchen, and corridor directly from the pose graph without fine-tuning. The above dataset is from a two floor office environment and contains 7 rooms as well as 2 corridors and stairs.
Figure 2: LEXIS system overview: The only inputs are RGB images and an odometry estimate from a visual-inertial state estimator, as well as a prompt list of potential room classes. The output is a semantic pose graph that encodes room information.
Figure 3: Room segmentation and refinement on a pose graph with data from the uHumans2 Apartment scene (uH2-Apt). (a) Initial room labels are given by CLIP. (b) The room labels post refinement. (c) Clustering into room instances. (d) Segmentation into floors.
Figure 4: Segmentations produced by LEXIS for the uHumans2 office (uH2-Off) dataset. Also shown are the ground-truth bounding boxes used in Hydra's evaluation. Misclassifications occur during room transitions (example A and B); or areas with fewer features (C).
Figure 5: Number of true positives and false positives (red $\blacksquare$) using three different VPR methods: DBoW, NetVLAD and LEXIS on the Home (left) and ORI (right) dataset averaged over 5 runs.
...and 2 more figures

Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

TL;DR

Abstract

Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time Visual Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)