Table of Contents
Fetching ...

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger

TL;DR

FindAnything presents a real-time open-world mapping framework that integrates vision-language features into deformable, object-centric volumetric submaps to enable open-vocabulary semantic understanding and language-guided exploration. By fusing CLIP-based embeddings with eSAM segments and tightly coupling with VI-SLAM, the system preserves geometry while supporting open-set semantics and scalable memory usage on resource-constrained hardware. The approach demonstrates state-of-the-art semantic accuracy on Replica, enables exploration driven by natural language queries, and shows successful onboard deployment on a MAV. Key contributions include online large-scale SLAM with drift-corrected submaps, long-term object-level semantic aggregation, and language-guided exploration that adjusts sampling and utilities toward user-specified concepts. The work advances practical open-world robotics by delivering a memory-efficient, online, open-vocabulary 3D mapping and exploration framework suitable for real robots.

Abstract

Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

TL;DR

FindAnything presents a real-time open-world mapping framework that integrates vision-language features into deformable, object-centric volumetric submaps to enable open-vocabulary semantic understanding and language-guided exploration. By fusing CLIP-based embeddings with eSAM segments and tightly coupling with VI-SLAM, the system preserves geometry while supporting open-set semantics and scalable memory usage on resource-constrained hardware. The approach demonstrates state-of-the-art semantic accuracy on Replica, enables exploration driven by natural language queries, and shows successful onboard deployment on a MAV. Key contributions include online large-scale SLAM with drift-corrected submaps, long-term object-level semantic aggregation, and language-guided exploration that adjusts sampling and utilities toward user-specified concepts. The work advances practical open-world robotics by delivering a memory-efficient, online, open-vocabulary 3D mapping and exploration framework suitable for real robots.

Abstract

Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.

Paper Structure

This paper contains 19 sections, 3 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Real-world demonstration of FindAnything on a resource-constrained MAV platform. Top: 3D reconstruction extracted from volumetric maps. Next to the colored meshes, we also display 3D CLIP activations in the object-centric map for the given language query sofa and the trajectory of the MAV during an exploration mission. Bottom: different modules of the system - Visual-Inertial SLAM, Depth Images, Segmentation Masks from eSAM xiong2024efficientsam, RGB Image and CLIP clip activations for the text query.
  • Figure 2: Overview of the proposed FindAnything framework. IMU measurements and stereo images are fed into a VI-SLAM system. Estimated poses are used to integrate depth images into volumetric occupancy submaps. eSAM xiong2024efficientsam is run on the RGB images to create object proposals which are tracked against the current map. After segment tracking, CLIP clip features are aggregated per object mask and fused into the current submap. Frontiers of the submaps are incrementally updated. A natural language user input is translated into a CLIP feature to guide the exploration towards areas of interest. A MPC is applied to execute planned paths on a MAV platform.
  • Figure 3: Closed-set semantic accuracy evaluation on the office3 Replica sequence. Left: Ground-truth labeled point cloud for considered classes. Right: Prediction from our mapping framework.
  • Figure 4: Mesh completeness (left column) and reconstruction error (right column) results of the 10th-90th percentile for FindAnything (blue plots) and baseline (orange plots). The top row are the results for the object of interest "bed" and the bottom row for "bathroom".
  • Figure 5: Example of the reconstruction obtained by FindAnything in the 00848-ziup5kvtCCR scene from the Habitat Matterport 3D dataset. The top region shows the colored mesh and the bottom region the cosine similarity of the different mapped objects against the query "bed". The light blue regions represent map objects with a high cosine similarity between $\mybf{\bar{f}} _{{\theta}}$ and $\mybf{f} _{\mathrm{q}}$ while dark blue regions have a low similarity. The orange line is the trajectory followed by the MAV while performing the exploration mission.