Table of Contents
Fetching ...

LOSS-SLAM: Lightweight Open-Set Semantic Simultaneous Localization and Mapping

Kurran Singh, Tim Magoun, John J. Leonard

TL;DR

LOSS-SLAM introduces a lightweight open-set semantic SLAM system that tightly couples sparse object embeddings with a probabilistic factor-graph backend. It leverages DINO-based patch encodings, clustering to latent-space object centroids, and data-association strategies to support open-set object detection and localization. The approach yields more accurate trajectories and more complete semantic maps than dense or closed-set methods while requiring less computational overhead. The method is demonstrated on real collected data and public datasets, with open-sourced code and data.

Abstract

Enabling robots to understand the world in terms of objects is a critical building block towards higher level autonomy. The success of foundation models in vision has created the ability to segment and identify nearly all objects in the world. However, utilizing such objects to localize the robot and build an open-set semantic map of the world remains an open research question. In this work, a system of identifying, localizing, and encoding objects is tightly coupled with probabilistic graphical models for performing open-set semantic simultaneous localization and mapping (SLAM). Results are presented demonstrating that the proposed lightweight object encoding can be used to perform more accurate object-based SLAM than existing open-set methods, closed-set methods, and geometric methods while incurring a lower computational overhead than existing open-set mapping methods.

LOSS-SLAM: Lightweight Open-Set Semantic Simultaneous Localization and Mapping

TL;DR

LOSS-SLAM introduces a lightweight open-set semantic SLAM system that tightly couples sparse object embeddings with a probabilistic factor-graph backend. It leverages DINO-based patch encodings, clustering to latent-space object centroids, and data-association strategies to support open-set object detection and localization. The approach yields more accurate trajectories and more complete semantic maps than dense or closed-set methods while requiring less computational overhead. The method is demonstrated on real collected data and public datasets, with open-sourced code and data.

Abstract

Enabling robots to understand the world in terms of objects is a critical building block towards higher level autonomy. The success of foundation models in vision has created the ability to segment and identify nearly all objects in the world. However, utilizing such objects to localize the robot and build an open-set semantic map of the world remains an open research question. In this work, a system of identifying, localizing, and encoding objects is tightly coupled with probabilistic graphical models for performing open-set semantic simultaneous localization and mapping (SLAM). Results are presented demonstrating that the proposed lightweight object encoding can be used to perform more accurate object-based SLAM than existing open-set methods, closed-set methods, and geometric methods while incurring a lower computational overhead than existing open-set mapping methods.
Paper Structure (15 sections, 16 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 16 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 2: The results of running an existing open-set mapping method murthy2023 on our data. Artifacts such as the doubled or tripled versions of objects that are a result of not explicitly reasoning about the objects in the scene.
  • Figure 3: Memory usage on sequence 1 of collected data. Sparse methods drastically reduce the memory consumption for open-set SLAM as compared to dense methods.
  • Figure 4: The closed-set map has a less accurate trajectory, and also identifies fewer objects in the scene. The geometric only data association method incorrectly associates objects in close proximity as being the same object as it does not have the object encoding to help differentiate those instances; the incorrect data associations result in a less accurate map and trajectory. Object color to class mappings were identified by a human in post-processing as follows for open-set: Red - electric socket; Green - sugar box; Dark blue - CheezIt box; Yellow - spam can; Pink - skateboard; White - trash bag; Light blue - trash bin. For closed-set: Red - skateboard; Green - trash can. For geometric only (note that the classes were not used during mapping, and are identified for comparison purposes only): Red - electric socket; Green - sugar box; Dark blue - Skateboard.
  • Figure 5: Images from the collected data were fed to YOLOv8 Jocher_YOLO_by_Ultralytics_2023, a state-of-the-art and widely used object detector. The detector failed to identify many of the common everyday items in the scene, and even incorrectly labeled one item. Our open-set detector identifies each object and associates it with a latent vector encoding.
  • Figure 6: Our method performs best regardless of data association method. Maximum likelihood has the best performance for open-set, closed-set, and geometric only, and thus for closed-set and geometric-only, we only show results for maximum likelihood data association.