Table of Contents
Fetching ...

Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

Jian-Jian Jiang, Xiao-Ming Wu, Yi-Xiang He, Ling-An Zeng, Yi-Lin Wei, Dandan Zhang, Wei-Shi Zheng

TL;DR

The paper tackles the challenge of learning bimanual manipulation by recognizing that tasks can be either uncoordinated or coordinated and that integrated control struggles with high-dimensional joint actions and phase-dependent cooperation. It introduces a Decoupled Interaction Framework that assigns independent policies per arm to simplify learning of uncoordinated tasks, coupled with a selective interaction module that adaptively modulates cross-arm information to support coordination. Empirical results on the RoboTwin benchmark show substantial improvements over state-of-the-art methods (e.g., a 23.5% average gain) and strong scalability to multi-agent scenarios, along with robust real-world performance. The work demonstrates the value of task-aware decoupling and selective interaction for efficient, flexible, and scalable bimanual and multi-agent manipulation, with code to be released to the community.

Abstract

Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.

Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework

TL;DR

The paper tackles the challenge of learning bimanual manipulation by recognizing that tasks can be either uncoordinated or coordinated and that integrated control struggles with high-dimensional joint actions and phase-dependent cooperation. It introduces a Decoupled Interaction Framework that assigns independent policies per arm to simplify learning of uncoordinated tasks, coupled with a selective interaction module that adaptively modulates cross-arm information to support coordination. Empirical results on the RoboTwin benchmark show substantial improvements over state-of-the-art methods (e.g., a 23.5% average gain) and strong scalability to multi-agent scenarios, along with robust real-world performance. The work demonstrates the value of task-aware decoupling and selective interaction for efficient, flexible, and scalable bimanual and multi-agent manipulation, with code to be released to the community.

Abstract

Bimanual robotic manipulation is an emerging and critical topic in the robotics community. Previous works primarily rely on integrated control models that take the perceptions and states of both arms as inputs to directly predict their actions. However, we think bimanual manipulation involves not only coordinated tasks but also various uncoordinated tasks that do not require explicit cooperation during execution, such as grasping objects with the closest hand, which integrated control frameworks ignore to consider due to their enforced cooperation in the early inputs. In this paper, we propose a novel decoupled interaction framework that considers the characteristics of different tasks in bimanual manipulation. The key insight of our framework is to assign an independent model to each arm to enhance the learning of uncoordinated tasks, while introducing a selective interaction module that adaptively learns weights from its own arm to improve the learning of coordinated tasks. Extensive experiments on seven tasks in the RoboTwin dataset demonstrate that: (1) Our framework achieves outstanding performance, with a 23.5% boost over the SOTA method. (2) Our framework is flexible and can be seamlessly integrated into existing methods. (3) Our framework can be effectively extended to multi-agent manipulation tasks, achieving a 28% boost over the integrated control SOTA. (4) The performance boost stems from the decoupled design itself, surpassing the SOTA by 16.5% in success rate with only 1/6 of the model size.

Paper Structure

This paper contains 20 sections, 7 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Integrated Control vs. Decoupled Interaction. The blue bar represents the success rate of coordinated and uncoordinated tasks for the integrated control baseline built upon our framework without decoupled interaction design. The green and orange bars represent the success rate of coordinated and uncoordinated tasks for our framework without interaction design and our framework respectively. Our experiments are conducted on two coordinated tasks and five uncoordinated tasks in the RoboTwin dataset robotwin. It can be observed that adding the decoupled design to the integrated control baseline promotes the learning of uncoordinated tasks. Furthermore, incorporating the interaction module on top of this design facilitates the learning of coordinated tasks.
  • Figure 2: Comparisons of our Decoupled Interaction Framework with integrated control frameworks. Integrated control frameworks (a) mainly use a single model that takes the observations and states of both arms as inputs and directly outputs their actions. Our Decoupled Interaction Framework (b) first assigns an independent model to each arm to solely handle the inputs of the current arm (the yellow lines). Then, different from the naive interaction modeling in integrated control frameworks, a selective interaction module is proposed that learns its weights from its own arm to perform explicit modeling (the green and blue lines) on the exchanged state features (the green and blue dashed lines).
  • Figure 3: Architecture of the Decoupled Interaction Framework. Our framework first assigns a separate model to each arm to process its inputs. Then, we exchange state features between the models and utilize a selective interaction module to modulate them. Specifically, we use a selector to predict a scaling factor $\alpha$ and a bias vector $\beta$ to adaptively adjust the exchanged features. Finally, we combine the original visual features, state features and exchanged state features as interactive conditions to predict actions using action generators.
  • Figure 4: Qualitative experiments. In Fig. (a), we visualize the execution process of the "Blocks Stack" task for DP3 and our framework. In Fig. (b), we visualize the execution process of the three-arm experiment for DP3 and our framework. In Fig. (c), we visualize the execution process of DP3 and our framework in the real-world experiment. Dashed circles of different colors highlight common issues that typically arise in integrated control frameworks. "SR" denotes the success rate, which represents the average success rate of the model across all tasks under different experimental settings. Zoom in for a better view.
  • Figure 5: Illustration of our real-world manipulation experimental settings. We use the Cobot Magic as our bimanual robots and include everyday objects in our manipulation tasks. A RealSense L515 camera is applied to capture 3D point clouds.
  • ...and 3 more figures