Table of Contents
Fetching ...

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

Sunghwan Kim, Woojeh Chung, Zhirui Dai, Dwait Bhatt, Arth Shukla, Hao Su, Yulun Tian, Nikolay Atanasov

TL;DR

This paper tackles the limitation of image-based policies in long-horizon mobile manipulation by introducing Seeing the Bigger Picture (SBP), a framework that learns and uses a persistent 3D latent map of the working scene. SBP consists of a modular latent mapping component (a multiresolution grid and a pre-trained scene-agnostic decoder) and a map-conditioned policy that aggregates the map into a global token to guide behavior, enabling global and temporally extended reasoning. The approach supports both behavior cloning and reinforcement learning, and it demonstrates improved performance over image-based baselines on scene-level and sequential manipulation tasks, including zero-shot sim-to-real transfer. By decoupling scene-specific feature optimization from a general decoder and by maintaining long-horizon memory through online map updates, SBP offers a practical path to robust, scalable mobile manipulation in complex, partially observed environments.

Abstract

In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 25% for the sequential manipulation task.

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

TL;DR

This paper tackles the limitation of image-based policies in long-horizon mobile manipulation by introducing Seeing the Bigger Picture (SBP), a framework that learns and uses a persistent 3D latent map of the working scene. SBP consists of a modular latent mapping component (a multiresolution grid and a pre-trained scene-agnostic decoder) and a map-conditioned policy that aggregates the map into a global token to guide behavior, enabling global and temporally extended reasoning. The approach supports both behavior cloning and reinforcement learning, and it demonstrates improved performance over image-based baselines on scene-level and sequential manipulation tasks, including zero-shot sim-to-real transfer. By decoupling scene-specific feature optimization from a general decoder and by maintaining long-horizon memory through online map updates, SBP offers a practical path to robust, scalable mobile manipulation in complex, partially observed environments.

Abstract

In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 25% for the sequential manipulation task.

Paper Structure

This paper contains 21 sections, 11 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) An RGB rendering of a ReplicaCAD szot2021habitat scene. (b) A 3D latent feature map constructed by our method, visualized with PCA. (c) Attention weights on the latent map during task execution (e.g., "pick up the bowl"), highlighting regions attended by the policy model.
  • Figure 2: Visualization of per-patch VLM embeddings back-projected into the 3D world frame using depth $Z[p]$ and camera pose $(R,t)$.
  • Figure 3: Latent feature mapping. We represent the scene with a multiresolution feature grid. For any query point $x$, we retrieve features from each level via trilinear interpolation, concatenate them to form $F_\psi(x)$, and decode with $D_\theta$ to reconstruct the target embedding. The model is trained to maximize similarity between predicted and ground-truth embeddings.
  • Figure 4: Global map token. Latent features from $\mathcal{F}$ are decoded to $\mathcal{Y}$ via the decoder $D_{\theta}$ at the finest grid vertices. The 3D feature aggregator processes the coordinate-feature pairs, and its output is max-pooled to produce the global map token $e_m$.
  • Figure 5: Map-conditioned policy network. Proprioceptive state $s_\tau$, image features $E_I(o_\tau)$, task embedding $e_\ell$, and global map token $e_m$ are concatenated to form a joint embedding $h_\tau$, which is mapped to an action $a_\tau$ by the policy network $\pi_{\phi}$.
  • ...and 3 more figures