Table of Contents
Fetching ...

mindmap: Spatial Memory in Deep Feature Maps for 3D Action Policies

Remo Steiner, Alexander Millane, David Tingdahl, Clemens Volk, Vikram Ramasamy, Xinjie Yao, Peter Du, Soha Pouya, Shiwei Sheng

TL;DR

Mindmap tackles the lack of spatial memory in end-to-end robotic manipulation by fusing a diffusion-based trajectory policy with a metric-semantic 3D reconstruction of the scene. By conditioning trajectory diffusion on both current RGB-D observations and a progressively built reconstruction, the approach enables actions that depend on objects and geometry outside the current field of view. The authors introduce architectural and data-processing changes, leverage a non-differentiable yet real-time reconstruction pipeline, and demonstrate significant improvements on four memory-dependent tasks, while releasing reconstruction tools and training code. This work highlights the importance of spatial memory for robust manipulation in non-tabular settings and points toward scalable memory-augmented policies for real-world robotics.

Abstract

End-to-end learning of robot control policies, structured as neural networks, has emerged as a promising approach to robotic manipulation. To complete many common tasks, relevant objects are required to pass in and out of a robot's field of view. In these settings, spatial memory - the ability to remember the spatial composition of the scene - is an important competency. However, building such mechanisms into robot learning systems remains an open research problem. We introduce mindmap (Spatial Memory in Deep Feature Maps for 3D Action Policies), a 3D diffusion policy that generates robot trajectories based on a semantic 3D reconstruction of the environment. We show in simulation experiments that our approach is effective at solving tasks where state-of-the-art approaches without memory mechanisms struggle. We release our reconstruction system, training code, and evaluation tasks to spur research in this direction.

mindmap: Spatial Memory in Deep Feature Maps for 3D Action Policies

TL;DR

Mindmap tackles the lack of spatial memory in end-to-end robotic manipulation by fusing a diffusion-based trajectory policy with a metric-semantic 3D reconstruction of the scene. By conditioning trajectory diffusion on both current RGB-D observations and a progressively built reconstruction, the approach enables actions that depend on objects and geometry outside the current field of view. The authors introduce architectural and data-processing changes, leverage a non-differentiable yet real-time reconstruction pipeline, and demonstrate significant improvements on four memory-dependent tasks, while releasing reconstruction tools and training code. This work highlights the importance of spatial memory for robust manipulation in non-tabular settings and points toward scalable memory-augmented policies for real-world robotics.

Abstract

End-to-end learning of robot control policies, structured as neural networks, has emerged as a promising approach to robotic manipulation. To complete many common tasks, relevant objects are required to pass in and out of a robot's field of view. In these settings, spatial memory - the ability to remember the spatial composition of the scene - is an important competency. However, building such mechanisms into robot learning systems remains an open research problem. We introduce mindmap (Spatial Memory in Deep Feature Maps for 3D Action Policies), a 3D diffusion policy that generates robot trajectories based on a semantic 3D reconstruction of the environment. We show in simulation experiments that our approach is effective at solving tasks where state-of-the-art approaches without memory mechanisms struggle. We release our reconstruction system, training code, and evaluation tasks to spur research in this direction.

Paper Structure

This paper contains 16 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Spatial Memory Task: A humanoid in a simulated industrial space (left) and within a metric-semantic reconstruction built by mindmap(right) (colored by PCA). The robot's first-person view is shown inset. The task requires the robot to transfer the hand drill from the shelf to the open box. The drill and box positions must be discovered by the policy, and both objects cannot be captured in a single view. Therefore, successful task completion requires the policy to remember the spatial layout of the scene. By leveraging the reconstruction, mindmap generates trajectories that depend on parts of the scene that are outside the robot's current FOV.
  • Figure 2: Overview of mindmap. mindmap is a DDPM that samples robot trajectories conditioned on sensor observations and a reconstruction of the environment. Images are first passed through a VFM and then back-projected, using the depth image, to a pointcloud (as in 3D Diffuser Actor 3d_diffuser_actor). In parallel, a reconstruction of the scene is built that accumulates metric-semantic information from past observations. The two 3D data sources, the instantaneous visual observation and the reconstruction, are passed to a transformer that iteratively denoises robot trajectories.
  • Figure 3: Environments introduced to evaluate policies' spatial memory. From left to right: Cube Stacking: stack three cubes (initial cube positions are randomized), Mug in Drawer move mug into drawer containing mugs (positions of objects on kitchen counter are randomized and the destination drawer position is permuted), Drill in Box: put hand drill into open box (drill position is randomized and open/closed boxes are permuted), Stick in Bin: put candlestick into bin (stick and bin positions are randomized). In all tasks, policies are provided a single ego-centric camera view from which the entire task space cannot fit into the FOV.
  • Figure 4: Attention Visualization: Top-down visualization of 3D attention weights (right) and reconstruction (left) for the Mug in Drawer task. The inset shows the current camera view. Extrema appear in regions of interest to the task, such as the mug (yellow arrow) and the drawers in the bottom left/right (white arrows). The high concentration of points in the center is generated by the current view of the camera, while points outside this region are from the reconstruction.
  • Figure 5: Reconstructions of the four environments presented in Section \ref{['sec:result']}. For each environment, we have an RGB-colored mesh (top) and the voxel grid containing VFM features colored by PCA (bottom).
  • ...and 1 more figures