Table of Contents
Fetching ...

PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks

Markus Grotz, Mohit Shridhar, Tamim Asfour, Dieter Fox

TL;DR

This work tackles the scarcity of standardized benchmarks for bimanual manipulation by extending RLBench into a 13-task, 23-variation bimanual suite and introducing PerAct^2, a language-conditioned, single-network agent that predicts coordinated 6-DoF actions for two arms via a shared voxel representation and Perceiver IO backbone. The approach enables learning from language goals and expert demonstrations, with keyframe-based training and a loss that jointly optimizes both arms. In simulation and real-world tests, PerAct^2 and PerAct-LF outperform image-based baselines and demonstrate transferability to humanoid platforms, albeit with limited overall success rates and clear failure modes. The benchmark and open-source release provide a foundation for reproducible evaluation and further advances in bimanual coordination for robots. Overall, the paper advances how we study and learn coordinated two-arm manipulation in realistic, diverse tasks.

Abstract

Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent -- PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: http://bimanual.github.io

PerAct2: Benchmarking and Learning for Robotic Bimanual Manipulation Tasks

TL;DR

This work tackles the scarcity of standardized benchmarks for bimanual manipulation by extending RLBench into a 13-task, 23-variation bimanual suite and introducing PerAct^2, a language-conditioned, single-network agent that predicts coordinated 6-DoF actions for two arms via a shared voxel representation and Perceiver IO backbone. The approach enables learning from language goals and expert demonstrations, with keyframe-based training and a loss that jointly optimizes both arms. In simulation and real-world tests, PerAct^2 and PerAct-LF outperform image-based baselines and demonstrate transferability to humanoid platforms, albeit with limited overall success rates and clear failure modes. The benchmark and open-source release provide a foundation for reproducible evaluation and further advances in bimanual coordination for robots. Overall, the paper advances how we study and learn coordinated two-arm manipulation in realistic, diverse tasks.

Abstract

Bimanual manipulation is challenging due to precise spatial and temporal coordination required between two arms. While there exist several real-world bimanual systems, there is a lack of simulated benchmarks with a large task diversity for systematically studying bimanual capabilities across a wide range of tabletop tasks. This paper addresses the gap by extending RLBench to bimanual manipulation. We open-source our code and benchmark comprising 13 new tasks with 23 unique task variations, each requiring a high degree of coordination and adaptability. To kickstart the benchmark, we extended several state-of-the art methods to bimanual manipulation and also present a language-conditioned behavioral cloning agent -- PerAct2, which enables the learning and execution of bimanual 6-DoF manipulation tasks. Our novel network architecture efficiently integrates language processing with action prediction, allowing robots to understand and perform complex bimanual tasks in response to user-specified goals. Project website with code is available at: http://bimanual.github.io
Paper Structure (19 sections, 2 equations, 17 figures, 5 tables)

This paper contains 19 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Selected bimanual tasks from the benchmark as well as real-world examples. Due to the architecture design the method can easily be transferred to other robots as the policy outputs a 6-D pose and is agnostic to the underlying controller.
  • Figure 2: The system architecture. PerAct$^2$ takes proprioception, RGB-D camera images as well as a task description as input. The voxel grid is constructed by merging data from multiple RGB-D cameras. A PerceiverIO transformer is utilized to learn features at both the voxel and language levels. The output for each robot arm includes a discretized action, which comprises a six-dimensional end-effector pose, the state of the gripper, and an extra indicator whether the motion-planner should use collision avoidance.
  • Figure 3: Overview of the different tasks. For example, the task visualized (g) includes a handover of a specific item.
  • Figure 4: Selected snapshots of the real world experiments showing different tasks. A video showing the full experiments is available on the project's to website.
  • Figure :
  • ...and 12 more figures