Table of Contents
Fetching ...

Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models

Haonan Chen, Jiaming Xu, Lily Sheng, Tianchen Ji, Shuijing Liu, Yunzhu Li, Katherine Driggs-Campbell

TL;DR

The paper tackles the challenge of coordinating two robotic arms for complex manipulation by introducing a state-prediction diffusion model paired with an inverse dynamics policy. By explicitly predicting future scene states and then computing actions to reach those states, the approach improves long-horizon planning, stability, and multimodal goal handling in bimanual tasks. The method outperforms end-to-end state-to-action baselines in both simulation and real-world experiments, including deformable and multi-object scenarios, and demonstrates robust sim-to-real transfer. This work offers a practical framework for predictive, coordinated manipulation with broader applicability to real-world robotic systems.

Abstract

When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.

Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models

TL;DR

The paper tackles the challenge of coordinating two robotic arms for complex manipulation by introducing a state-prediction diffusion model paired with an inverse dynamics policy. By explicitly predicting future scene states and then computing actions to reach those states, the approach improves long-horizon planning, stability, and multimodal goal handling in bimanual tasks. The method outperforms end-to-end state-to-action baselines in both simulation and real-world experiments, including deformable and multi-object scenarios, and demonstrates robust sim-to-real transfer. This work offers a practical framework for predictive, coordinated manipulation with broader applicability to real-world robotic systems.

Abstract

When performing tasks like laundry, humans naturally coordinate both hands to manipulate objects and anticipate how their actions will change the state of the clothes. However, achieving such coordination in robotics remains challenging due to the need to model object movement, predict future states, and generate precise bimanual actions. In this work, we address these challenges by infusing the predictive nature of human manipulation strategies into robot imitation learning. Specifically, we disentangle task-related state transitions from agent-specific inverse dynamics modeling to enable effective bimanual coordination. Using a demonstration dataset, we train a diffusion model to predict future states given historical observations, envisioning how the scene evolves. Then, we use an inverse dynamics model to compute robot actions that achieve the predicted states. Our key insight is that modeling object movement can help learning policies for bimanual coordination manipulation tasks. Evaluating our framework across diverse simulation and real-world manipulation setups, including multimodal goal configurations, bimanual manipulation, deformable objects, and multi-object setups, we find that it consistently outperforms state-of-the-art state-to-action mapping policies. Our method demonstrates a remarkable capacity to navigate multimodal goal configurations and action distributions, maintain stability across different control modes, and synthesize a broader range of behaviors than those present in the demonstration dataset.

Paper Structure

This paper contains 21 sections, 9 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Prediction-aided imitation learning for coordinated bimanual manipulation. In the left image, the L-shaped blocks are represented by keypoints, with their predicted future trajectories visualized. The diffusion model predicts future states, which the inverse dynamics model uses along with the previous state to generate actions. Our framework is validated with underactuated systems, deformable objects, bimanual coordination, and multi-object interactions, demonstrated in tasks such as push-L, laundry cleanup, fruit holding, and cluttered shelf picking.
  • Figure 2: Overview of the proposed framework. At time step $t$, the Diffusion Model takes as input the latest $T_s$ steps of state data $\mathbf{S}_t$ and outputs the denoised future states. The resulting sequence of state is then sliced and processed by the inverse dynamics model to generate corresponding actions at each feasible time step within the prediction horizon. In the example of push-L task, the manipulated object state is characterized by a particle-based representation, as shown in the images on the left.
  • Figure 3: Simulation Benchmarks. The XArm robot needs to push two blocks into randomized square positions. The Franka robot needs to manipulate seven objects in a virtual kitchen. The agent needs to push two L-shaped blocks to a target location.
  • Figure 4: Performance comparison in simulation between ours and improved diffusion policy. The x-axis represents the dataset size, and the y-axis represents the success rate. Our method demonstrates higher sample efficiency as it achieves better performance with the same dataset size due to the utilization of more supervision signals.
  • Figure 5: Real-world comparison of different models on the Push-L task. In the first row, the improved diffusion model at first pushes the orange block towards the target position, but then over-rotates this block, resulting in a failed trial. In the second row, our model pushes the orange block towards the blue block to form a joint rectangle. The agent then pushes the joint rectangle toward the target location successfully.
  • ...and 7 more figures