Table of Contents
Fetching ...

Language-Embedded Gaussian Splats (LEGS): Incrementally Building Room-Scale Representations with a Mobile Robot

Justin Yu, Kush Hari, Kishore Srinivas, Karim El-Refai, Adam Rashid, Chung Min Kim, Justin Kerr, Richard Cheng, Muhammad Zubair Irshad, Ashwin Balakrishna, Thomas Kollar, Ken Goldberg

TL;DR

This work tackles open-vocabulary semantic mapping for mobile robots in large indoor spaces by introducing LEGS, a system that incrementally builds a room-scale 3D semantic map using Language-Embedded Gaussian Splats. LEGS combines online multi-camera reconstruction, incremental 3D Gaussian Splat construction, and a language-grounded, hash-encoded semantic field to enable fast, open-vocabulary object localization. Empirical results show LEGS trains about 3.5x faster than a LERF baseline while achieving comparable object recall, with up to 66% localization accuracy for open-ended queries; multi-camera configurations and global bundle adjustment further improve reconstruction quality. The approach offers practical benefits for real-time robotic perception and querying, enabling robust semantic understanding in large indoor environments, with future work extending to dynamic scenes and autonomous exploration.

Abstract

Building semantic 3D maps is valuable for searching for objects of interest in offices, warehouses, stores, and homes. We present a mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS): a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as a robot traverses its environment to enable localization of open-vocabulary object queries. We evaluate LEGS on 4 room-scale scenes where we query for objects in the scene to assess how LEGS can capture semantic meaning. We compare LEGS to LERF and find that while both systems have comparable object query success rates, LEGS trains over 3.5x faster than LERF. Results suggest that a multi-camera setup and incremental bundle adjustment can boost visual reconstruction quality in constrained robot trajectories, and suggest LEGS can localize open-vocabulary and long-tail object queries with up to 66% accuracy.

Language-Embedded Gaussian Splats (LEGS): Incrementally Building Room-Scale Representations with a Mobile Robot

TL;DR

This work tackles open-vocabulary semantic mapping for mobile robots in large indoor spaces by introducing LEGS, a system that incrementally builds a room-scale 3D semantic map using Language-Embedded Gaussian Splats. LEGS combines online multi-camera reconstruction, incremental 3D Gaussian Splat construction, and a language-grounded, hash-encoded semantic field to enable fast, open-vocabulary object localization. Empirical results show LEGS trains about 3.5x faster than a LERF baseline while achieving comparable object recall, with up to 66% localization accuracy for open-ended queries; multi-camera configurations and global bundle adjustment further improve reconstruction quality. The approach offers practical benefits for real-time robotic perception and querying, enabling robust semantic understanding in large indoor environments, with future work extending to dynamic scenes and autonomous exploration.

Abstract

Building semantic 3D maps is valuable for searching for objects of interest in offices, warehouses, stores, and homes. We present a mapping system that incrementally builds a Language-Embedded Gaussian Splat (LEGS): a detailed 3D scene representation that encodes both appearance and semantics in a unified representation. LEGS is trained online as a robot traverses its environment to enable localization of open-vocabulary object queries. We evaluate LEGS on 4 room-scale scenes where we query for objects in the scene to assess how LEGS can capture semantic meaning. We compare LEGS to LERF and find that while both systems have comparable object query success rates, LEGS trains over 3.5x faster than LERF. Results suggest that a multi-camera setup and incremental bundle adjustment can boost visual reconstruction quality in constrained robot trajectories, and suggest LEGS can localize open-vocabulary and long-tail object queries with up to 66% accuracy.
Paper Structure (18 sections, 7 figures, 2 tables)

This paper contains 18 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Language-Embedded Gaussian Splat in TRI Grocery Store Testbed bajracharya2024demonstrating. LEGS relies entirely on pretrained VLMs and does not require any inventory data or finetuning.
  • Figure 2: LEGS System Integration For LEGS, we use a Fetch robot with a custom multicamera configuration where a Realsense D455 is facing forward while 2 Zed cameras face the left and right sides respectively. The left Zed image stream is inputted into DROID-SLAM to compute pose estimates for the left camera, and the corresponding extrinsics are used to compute the pose estimates for the other Zed camera and D455. These image-poses are then used for concurrent Gaussian splat and CLIP training online. From there, the Gaussian splat can be queried for an object (ex. "First Aid Kit"), and the corresponding relevancy field will be computed to localize the desired object.
  • Figure 3: 4 Scene Environments.
  • Figure 4: Single Camera Reconstruction Comparison Results. We compare the quality of Gaussian splats on an Intel Realsense D435, Intel Realsense D455, and Stereolabs Zed 2 with and without bundle adjustment. For each configuration we present two views: one of the Gaussian splat facing the kitchen island head-on and another view at an angle.
  • Figure 5: Successful query localization results. Coordinate frames on open-vocabulary and long-tail objects (a) "garfield," (b) "hearing protection."
  • ...and 2 more figures