Table of Contents
Fetching ...

DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad Mahi Shafiullah, Lerrel Pinto

TL;DR

The paper addresses open-vocabulary mobile manipulation in open-world, dynamic environments by introducing DynaMem, a dynamic spatio-semantic memory implemented as a sparse 3D voxel map that stores per-voxel data ($x$, $y$, $z$), $C$, $I$, $f$, and $t$ and supports online additions/removals. It provides two grounding modalities for open vocabulary queries: vision-language model feature embedding and multimodal LLM QA, with a hybrid approach that blends both, plus an OWL-v2 cross-check to reduce false positives. DynaMem drives navigation with obstacle mapping, frontier-based exploration, and dual exploration value maps for time-based novelty and semantic similarity, all within a closed-loop planner that periodically re-plans. Real-world experiments on Stretch robots and the DynaBench offline benchmark show a 70% success rate on dynamic objects, over 2x better than static baselines, demonstrating practical viability and establishing a dynamic benchmark for OVMM.

Abstract

Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real-world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: https://dynamem.github.io/

DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

TL;DR

The paper addresses open-vocabulary mobile manipulation in open-world, dynamic environments by introducing DynaMem, a dynamic spatio-semantic memory implemented as a sparse 3D voxel map that stores per-voxel data (, , ), , , , and and supports online additions/removals. It provides two grounding modalities for open vocabulary queries: vision-language model feature embedding and multimodal LLM QA, with a hybrid approach that blends both, plus an OWL-v2 cross-check to reduce false positives. DynaMem drives navigation with obstacle mapping, frontier-based exploration, and dual exploration value maps for time-based novelty and semantic similarity, all within a closed-loop planner that periodically re-plans. Real-world experiments on Stretch robots and the DynaBench offline benchmark show a 70% success rate on dynamic objects, over 2x better than static baselines, demonstrating practical viability and establishing a dynamic benchmark for OVMM.

Abstract

Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system's applicability in real-world scenarios where environments frequently change due to human intervention or the robot's own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot's environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, which is more than a 2x improvement over state-of-the-art static systems. Our code as well as our experiment and deployment videos are open sourced and can be found on our project website: https://dynamem.github.io/

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An illustration of how our online dynamic spatio-semantic memory DynaMem responds to open vocabulary queries in a dynamic environment. During operation and exploration, DynaMem keeps updating its semantic map in memory. DynaMem maintains a voxelized pointcloud representation of the environment, and updates with dynamic changes in the environment by adding and removing points.
  • Figure 2: (Left) DynaMem keeps its memory stored in a sparse voxel grid with associated information at each voxel. (Right) Updating DynaMem by adding new points to it, alongside the rules used to update the stored information.
  • Figure 3: A high-level, 2D depiction of how adding and removing voxels from the voxel map works. New voxels are included which are in the RGB-D cameras view frustum, and old voxels that should block the view frustum but does not are removed from the map.
  • Figure 4: Querying DynaMem with a natural language query. First, we find the voxel with the highest alighnment to the query. Next, we find the latest image of that voxel, and query with an open-vocabulary object detector to confirm the object location or abstain.
  • Figure 5: The prompting system for querying multimodal LLMs such as GPT-4o or Gemini-1.5 for the image index for an object query.
  • ...and 2 more figures