Table of Contents
Fetching ...

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

Chongyang Xu, Haipeng Li, Shen Cheng, Jingyu Hu, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

TL;DR

The policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap.

Abstract

Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git.

Action-Geometry Prediction with 3D Geometric Prior for Bimanual Manipulation

TL;DR

The policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap.

Abstract

Bimanual manipulation requires policies that can reason about 3D geometry, anticipate how it evolves under action, and generate smooth, coordinated motions. However, existing methods typically rely on 2D features with limited spatial awareness, or require explicit point clouds that are difficult to obtain reliably in real-world settings. At the same time, recent 3D geometric foundation models show that accurate and diverse 3D structure can be reconstructed directly from RGB images in a fast and robust manner. We leverage this opportunity and propose a framework that builds bimanual manipulation directly on a pre-trained 3D geometric foundation model. Our policy fuses geometry-aware latents, 2D semantic features, and proprioception into a unified state representation, and uses diffusion model to jointly predict a future action chunk and a future 3D latent that decodes into a dense pointmap. By explicitly predicting how the 3D scene will evolve together with the action sequence, the policy gains strong spatial understanding and predictive capability using only RGB observations. We evaluate our method both in simulation on the RoboTwin benchmark and in real-world robot executions. Our approach consistently outperforms 2D-based and point-cloud-based baselines, achieving state-of-the-art performance in manipulation success, inter-arm coordination, and 3D spatial prediction accuracy. Code is available at https://github.com/Chongyang-99/GAP.git.
Paper Structure (17 sections, 7 equations, 5 figures, 5 tables)

This paper contains 17 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Paradigm Comparison. 2D-based methods learn implicit 3D representations from multi-view RGB observations, relying purely on 2D cues. 3D-based methods require camera calibration and preset workspaces to crop point clouds, which limits generalization and scalability. In contrast, our approach leverages powerful 2D and 3D pretrained priors to achieve semantic–geometric fusion perception, enabling robust action and geometry joint prediction without strict calibration or workspace constraints.
  • Figure 2: Overview of our method. Given a sequence of past RGB frames, the current image, and proprioceptive state, our model extracts 3D geometric features, 2D semantic features, and robot state embeddings through three parallel encoders. These signals are fused by a Transformer into a unified semantic and geometric context that conditions a joint denoising process. A conditional diffusion decoder then predicts both a future action chunk and a future 3D latent, which is further decoded into a dense pointmap.
  • Figure 3: Bimanual tasks in the RoboTwin 2.0mu2025robotwin benchmark.
  • Figure 4: Data Efficiency. Leveraging pre-trained features, our method achieves high data efficiency, outperforming 2D methods in low-data regimes and surpassing the performance of DP3 as more data becomes available.
  • Figure 5: Real-World Setting. Our real-world platform featuring the AgileX Cobot Magic bimanual system, equipped with three RealSense D435i cameras to evaluate four challenging tasks.