PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks
Markus Grotz, Mohit Shridhar, Tamim Asfour, Dieter Fox
TL;DR
This work tackles the scarcity of standardized benchmarks for bimanual manipulation by extending RLBench into a 13-task, 23-variation bimanual suite and introducing PerAct^2, a language-conditioned, single-network agent that predicts coordinated 6-DoF actions for two arms via a shared voxel representation and Perceiver IO backbone. The approach enables learning from language goals and expert demonstrations, with keyframe-based training and a loss that jointly optimizes both arms. In simulation and real-world tests, PerAct^2 and PerAct-LF outperform image-based baselines and demonstrate transferability to humanoid platforms, albeit with limited overall success rates and clear failure modes. The benchmark and open-source release provide a foundation for reproducible evaluation and further advances in bimanual coordination for robots. Overall, the paper advances how we study and learn coordinated two-arm manipulation in realistic, diverse tasks.
Abstract
Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent -- PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: http://bimanual.github.io
