Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

Sunghwan Kim; Woojeh Chung; Zhirui Dai; Dwait Bhatt; Arth Shukla; Hao Su; Yulun Tian; Nikolay Atanasov

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

Sunghwan Kim, Woojeh Chung, Zhirui Dai, Dwait Bhatt, Arth Shukla, Hao Su, Yulun Tian, Nikolay Atanasov

TL;DR

This paper tackles the limitation of image-based policies in long-horizon mobile manipulation by introducing Seeing the Bigger Picture (SBP), a framework that learns and uses a persistent 3D latent map of the working scene. SBP consists of a modular latent mapping component (a multiresolution grid and a pre-trained scene-agnostic decoder) and a map-conditioned policy that aggregates the map into a global token to guide behavior, enabling global and temporally extended reasoning. The approach supports both behavior cloning and reinforcement learning, and it demonstrates improved performance over image-based baselines on scene-level and sequential manipulation tasks, including zero-shot sim-to-real transfer. By decoupling scene-specific feature optimization from a general decoder and by maintaining long-horizon memory through online map updates, SBP offers a practical path to robust, scalable mobile manipulation in complex, partially observed environments.

Abstract

In this paper, we demonstrate that mobile manipulation policies utilizing a 3D latent map achieve stronger spatial and temporal reasoning than policies relying solely on images. We introduce Seeing the Bigger Picture (SBP), an end-to-end policy learning approach that operates directly on a 3D map of latent features. In SBP, the map extends perception beyond the robot's current field of view and aggregates observations over long horizons. Our mapping approach incrementally fuses multiview observations into a grid of scene-specific latent features. A pre-trained, scene-agnostic decoder reconstructs target embeddings from these features and enables online optimization of the map features during task execution. A policy, trainable with behavior cloning or reinforcement learning, treats the latent map as a state variable and uses global context from the map obtained via a 3D feature aggregator. We evaluate SBP on scene-level mobile manipulation and sequential tabletop manipulation tasks. Our experiments demonstrate that SBP (i) reasons globally over the scene, (ii) leverages the map as long-horizon memory, and (iii) outperforms image-based policies in both in-distribution and novel scenes, e.g., improving the success rate by 25% for the sequential manipulation task.

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

TL;DR

Abstract

Seeing the Bigger Picture: 3D Latent Mapping for Mobile Manipulation Policy Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)