Table of Contents
Fetching ...

Lifelong LERF: Local 3D Semantic Inventory Monitoring Using FogROS2

Adam Rashid, Chung Min Kim, Justin Kerr, Letian Fu, Kush Hari, Ayah Ahmad, Kaiyuan Chen, Huang Huang, Marcus Gualtieri, Michael Wang, Christian Juette, Nan Tian, Liu Ren, Ken Goldberg

TL;DR

This work introduces Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap.

Abstract

Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap. Human users can query inventory by providing natural language queries and receiving a 3D heatmap of potential object locations. To manage the computational load, we use Fog-ROS2, a cloud robotics platform, to offload resource-intensive tasks. Lifelong LERF obtains poses from a monocular RGBD SLAM backend, and uses these poses to progressively optimize a Language Embedded Radiance Field (LERF) for semantic monitoring. Experiments with 3-5 objects arranged on a tabletop and a Turtlebot with a RealSense camera suggest that Lifelong LERF can persistently adapt to changes in objects with up to 91% accuracy.

Lifelong LERF: Local 3D Semantic Inventory Monitoring Using FogROS2

TL;DR

This work introduces Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap.

Abstract

Inventory monitoring in homes, factories, and retail stores relies on maintaining data despite objects being swapped, added, removed, or moved. We introduce Lifelong LERF, a method that allows a mobile robot with minimal compute to jointly optimize a dense language and geometric representation of its surroundings. Lifelong LERF maintains this representation over time by detecting semantic changes and selectively updating these regions of the environment, avoiding the need to exhaustively remap. Human users can query inventory by providing natural language queries and receiving a 3D heatmap of potential object locations. To manage the computational load, we use Fog-ROS2, a cloud robotics platform, to offload resource-intensive tasks. Lifelong LERF obtains poses from a monocular RGBD SLAM backend, and uses these poses to progressively optimize a Language Embedded Radiance Field (LERF) for semantic monitoring. Experiments with 3-5 objects arranged on a tabletop and a Turtlebot with a RealSense camera suggest that Lifelong LERF can persistently adapt to changes in objects with up to 91% accuracy.
Paper Structure (24 sections, 1 equation, 6 figures, 2 tables)

This paper contains 24 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Lifelong LERF Example. Top: A mobile robot takes a scan of the scene and builds a Language Embedded Radiance Field (LERF). Then, the scene is altered, for example the "Clorox" wipes are replaced by a cookie can. The robot periodically rescans the scene and identifies what has changed by rendering semantic features stored in LERF and comparing them against those extracted from the newly captured images. Bottom: Lifelong LERF efficiently updates the LERF using new images of the scene, progressively changing local geometry and semantics.
  • Figure 2: Experiment setup and FOG-ROS2 Integration.Left: We use a TurtleBot 4 as the mobile robot. We mount a RealSense D457 RGBD camera on top of the robot via a monopod. Right: We use FOG-ROS2 to execute both DROID-SLAM and LERF on a cloud machine. In particular, after the DROID-SLAM node obtains paired RGB and depth observations, it computes the camera pose. The LERF node reconstructs the LERF and calculates if the new observation is semantically inconsistent with the stored representation.
  • Figure 3: Semantic differencing. The semantic differencing module calculates 2D feature maps from the fresh observation (top), the 3D LERF embeddings (middle), and the 2D CLIP embeddings of a NeRF-rendered image. $\phi^{rend}$ approximates the distribution shift from 2D to 3D, resulting in higher-quality semantic difference heatmaps (bottom). See Sec \ref{['sec:method-diff']} for details.
  • Figure 4: Experiment setup(Left): Test objects; (Right): Three types of scene changes included in evaluation. Red box denotes the changed scene region.
  • Figure 5: Sequential scene update. Scene reconstructions are shown in the middle, and human queries are shown at the bottom. As the scene updates, the heatmap for human queries updates accordingly.
  • ...and 1 more figures