Table of Contents
Fetching ...

ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy

Xinzhang Yang, Renjun Wu, Jinyan Liu, Xuesong Li

Abstract

Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/

ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion Policy

Abstract

Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/
Paper Structure (27 sections, 5 equations, 5 figures, 4 tables)

This paper contains 27 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) ReMAP-DP is a multi-view projection framework for visuomotor control. Unlike approaches relying on sparse point cloud downsampling, our work leverages dense PointMaps and RGB inputs, synergizing explicit geometric structures with visual semantics through cross-modal transformer fusion. (b) The experimental results indicate that ReMAP-DP outperforms multiple strong baselines and (c) demonstrates robust performance across a diverse range of real-world tasks.
  • Figure 2: Overall architecture of ReMAP-DP. The method consists of three components: (1) Back projection and re-projection are used to obtain RGB and point maps for workspace alignment and denoising. (2) A dual stream encoder and transformer is used to process RGB and geometric features and fuse them together. (3) An action generation module employs a 1D-UNet Diffusion Policy to predict precise robot actions conditioned on the fused multi-modal features.
  • Figure 3: Top: Visualization of ten selected tasks in RoboTwin 2.0 Benchmarks. Bottom: Visualization of five selected tasks in ManiSkill 3 Benchmarks.
  • Figure 4: Left: Architecture of Geometry Encoder. Mid: Efficacy of Multi-View Projection. Right: Impact of Modality Embeddings.
  • Figure 5: Real-World Experiments. Top: Real-World experiments setup, including cameras' positions and robotic arms' settings. Bottom: Objects used in tasks and their sizes.