Gripper Keypose and Object Pointflow as Interfaces for Bimanual Robotic Manipulation
Yuyin Yang, Zetao Cai, Yang Tian, Jia Zeng, Jiangmiao Pang
TL;DR
This work tackles the challenge of versatile, precise bimanual robotic manipulation by proposing PPI, an end-to-end interface-based policy that integrates target gripper keyposes and object pointflow into a diffusion Transformer-based predictor. By fusing rich perception via 3D semantic fields with two actionable interfaces, PPI delivers improved spatial localization and flexible trajectory generation, achieving state-of-the-art results on RLBench2 and strong real-world performance. The approach demonstrates notable generalization to unseen objects and robustness to visual disturbances, while ablations confirm the complementary value of the two interfaces. Despite computational costs from diffusion and foundation-model components, the results suggest significant practical potential for complex, long-horizon bimanual tasks in real-world settings.
Abstract
Bimanual manipulation is a challenging yet crucial robotic capability, demanding precise spatial localization and versatile motion trajectories, which pose significant challenges to existing approaches. Existing approaches fall into two categories: keyframe-based strategies, which predict gripper poses in keyframes and execute them via motion planners, and continuous control methods, which estimate actions sequentially at each timestep. The keyframe-based method lacks inter-frame supervision, struggling to perform consistently or execute curved motions, while the continuous method suffers from weaker spatial perception. To address these issues, this paper introduces an end-to-end framework PPI (keyPose and Pointflow Interface), which integrates the prediction of target gripper poses and object pointflow with the continuous actions estimation. These interfaces enable the model to effectively attend to the target manipulation area, while the overall framework guides diverse and collision-free trajectories. By combining interface predictions with continuous actions estimation, PPI demonstrates superior performance in diverse bimanual manipulation tasks, providing enhanced spatial localization and satisfying flexibility in handling movement restrictions. In extensive evaluations, PPI significantly outperforms prior methods in both simulated and real-world experiments, achieving state-of-the-art performance with a +16.1% improvement on the RLBench2 simulation benchmark and an average of +27.5% gain across four challenging real-world tasks. Notably, PPI exhibits strong stability, high precision, and remarkable generalization capabilities in real-world scenarios. Project page: https://yuyinyang3y.github.io/PPI/
