Table of Contents
Fetching ...

Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation

Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, Jiangmiao Pang

TL;DR

This work tackles the challenge of versatile, precise bimanual robotic manipulation by proposing PPI, an end-to-end interface-based policy that integrates target gripper keyposes and object pointflow into a diffusion Transformer-based predictor. By fusing rich perception via 3D semantic fields with two actionable interfaces, PPI delivers improved spatial localization and flexible trajectory generation, achieving state-of-the-art results on RLBench2 and strong real-world performance. The approach demonstrates notable generalization to unseen objects and robustness to visual disturbances, while ablations confirm the complementary value of the two interfaces. Despite computational costs from diffusion and foundation-model components, the results suggest significant practical potential for complex, long-horizon bimanual tasks in real-world settings.

Abstract

Bimanual manipulation is a challenging yet crucial robotic capability, demanding precise spatial localization and versatile motion trajectories, which pose significant challenges to existing approaches. Existing approaches fall into two categories: keyframe-based strategies, which predict gripper poses in keyframes and execute them via motion planners, and continuous control methods, which estimate actions sequentially at each timestep. The keyframe-based method lacks inter-frame supervision, struggling to perform consistently or execute curved motions, while the continuous method suffers from weaker spatial perception. To address these issues, this paper introduces an end-to-end framework PPI (keyPose and Pointflow Interface), which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation. These interfaces enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories. By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions. In extensive evaluations, PPI significantly outperforms prior methods in both simulated and real-world experiments, achieving state-of-the-art performance with a +16.1% improvement on the RLBench2 simulation benchmark and an average of +27.5% gain across four challenging real-world tasks. Notably, PPI exhibits strong stability, high precision, and remarkable generalization capabilities in real-world scenarios. Project page: https://yuyinyang3y.github.io/PPI/

Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation

TL;DR

This work tackles the challenge of versatile, precise bimanual robotic manipulation by proposing PPI, an end-to-end interface-based policy that integrates target gripper keyposes and object pointflow into a diffusion Transformer-based predictor. By fusing rich perception via 3D semantic fields with two actionable interfaces, PPI delivers improved spatial localization and flexible trajectory generation, achieving state-of-the-art results on RLBench2 and strong real-world performance. The approach demonstrates notable generalization to unseen objects and robustness to visual disturbances, while ablations confirm the complementary value of the two interfaces. Despite computational costs from diffusion and foundation-model components, the results suggest significant practical potential for complex, long-horizon bimanual tasks in real-world settings.

Abstract

Bimanual manipulation is a challenging yet crucial robotic capability, demanding precise spatial localization and versatile motion trajectories, which pose significant challenges to existing approaches. Existing approaches fall into two categories: keyframe-based strategies, which predict gripper poses in keyframes and execute them via motion planners, and continuous control methods, which estimate actions sequentially at each timestep. The keyframe-based method lacks inter-frame supervision, struggling to perform consistently or execute curved motions, while the continuous method suffers from weaker spatial perception. To address these issues, this paper introduces an end-to-end framework PPI (keyPose and Pointflow Interface), which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation. These interfaces enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories. By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions. In extensive evaluations, PPI significantly outperforms prior methods in both simulated and real-world experiments, achieving state-of-the-art performance with a +16.1% improvement on the RLBench2 simulation benchmark and an average of +27.5% gain across four challenging real-world tasks. Notably, PPI exhibits strong stability, high precision, and remarkable generalization capabilities in real-world scenarios. Project page: https://yuyinyang3y.github.io/PPI/

Paper Structure

This paper contains 28 sections, 5 equations, 15 figures, 21 tables.

Figures (15)

  • Figure 1: In contrast to (i) keyframe-based policies, which excel in spatial localization but struggle with movement restrictions (e.g., curved motion and collision-free actions), and (ii) continuous-action-based policies, which accommodate diverse trajectories but lack strong perception, we introduce a continuous action policy that incorporates two interfaces: target gripper poses and object pointflow, balancing task diversity with spatial awareness. Our model, PPI, surpasses previous states of the art and consistently outperforms its ablated variants.
  • Figure 2: Overview of PPI. (a) Perception. We first construct a 3D semantic neural field $S_t$ and sample initial query points $F_0$ for pointflow prediction. (b) Interface. Next, we define two intermediate interfaces: target gripper poses $a_t^k$ and object pointflow $F$. (c) Prediction. Finally, a diffusion transformer incorporates robot proprio tokens $c_t$, scene tokens $S_t$, language tokens $l$, pointflow query tokens $F_0$ and action tokens $a_t^k$ and $a_t^{c}$ with gaussian noise. Using a carefully designed unidirectional attention, the model progressively denoises action predictions conditioned on the interfaces.
  • Figure 3: Task process visualizations in four real-world tasks.
  • Figure 4: Two real-world setups.
  • Figure 5: Visualization under object interference in Carry the Tray.
  • ...and 10 more figures