Table of Contents
Fetching ...

Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation

Qi Lv, Hao Li, Xiang Deng, Rui Shao, Yinchuan Li, Jianye Hao, Longxiang Gao, Michael Yu Wang, Liqiang Nie

TL;DR

Experimental results show that the proposed Kinematics enhanced Spatial-TemporAl gRaph Diffuser effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world.

Abstract

Despite the significant success of imitation learning in robotic manipulation, its application to bimanual tasks remains highly challenging. Existing approaches mainly learn a policy to predict a distant next-best end-effector pose (NBP) and then compute the corresponding joint rotation angles for motion using inverse kinematics. However, they suffer from two important issues: (1) rarely considering the physical robotic structure, which may cause self-collisions or interferences, and (2) overlooking the kinematics constraint, which may result in the predicted poses not conforming to the actual limitations of the robot joints. In this paper, we propose Kinematics enhanced Spatial-TemporAl gRaph Diffuser (KStar Diffuser). Specifically, (1) to incorporate the physical robot structure information into action prediction, KStar Diffuser maintains a dynamic spatial-temporal graph according to the physical bimanual joint motions at continuous timesteps. This dynamic graph serves as the robot-structure condition for denoising the actions; (2) to make the NBP learning objective consistent with kinematics, we introduce the differentiable kinematics to provide the reference for optimizing KStar Diffuser. This module regularizes the policy to predict more reliable and kinematics-aware next end-effector poses. Experimental results show that our method effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world

Spatial-Temporal Graph Diffusion Policy with Kinematic Modeling for Bimanual Robotic Manipulation

TL;DR

Experimental results show that the proposed Kinematics enhanced Spatial-TemporAl gRaph Diffuser effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world.

Abstract

Despite the significant success of imitation learning in robotic manipulation, its application to bimanual tasks remains highly challenging. Existing approaches mainly learn a policy to predict a distant next-best end-effector pose (NBP) and then compute the corresponding joint rotation angles for motion using inverse kinematics. However, they suffer from two important issues: (1) rarely considering the physical robotic structure, which may cause self-collisions or interferences, and (2) overlooking the kinematics constraint, which may result in the predicted poses not conforming to the actual limitations of the robot joints. In this paper, we propose Kinematics enhanced Spatial-TemporAl gRaph Diffuser (KStar Diffuser). Specifically, (1) to incorporate the physical robot structure information into action prediction, KStar Diffuser maintains a dynamic spatial-temporal graph according to the physical bimanual joint motions at continuous timesteps. This dynamic graph serves as the robot-structure condition for denoising the actions; (2) to make the NBP learning objective consistent with kinematics, we introduce the differentiable kinematics to provide the reference for optimizing KStar Diffuser. This module regularizes the policy to predict more reliable and kinematics-aware next end-effector poses. Experimental results show that our method effectively leverages the physical structural information and generates kinematics-aware actions in both simulation and real-world

Paper Structure

This paper contains 61 sections, 22 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The self-collision problem in bimanual manipulation tasks due to overlooking the robotic structure.
  • Figure 2: Overview of KStar Diffuser. The top part presents the spatial-temporal graph which is constructed according to the robot architecture. The bottom part shows our backbone and the proposed kinematics regularizer. For the backbone, it extracts the multimodal information which consists of multiview RGB-D observations and language instruction, and then generates bimanual 6D end-effector poses. The kinematics regularizer enhances pose learning by incorporating joint-level predictions, which are mapped to reference end-effector poses through differentiable forward kinematics (FK).
  • Figure 3: The left: the simulation environment of pick_laptop task. The right: the ALOHA device used in the real-world tasks.
  • Figure 4: The visualization of bimanual manipulation on simulated RLBench2 and real-world tasks. The blue annotations represent the motion of the robot's left arm, while the green annotations indicate the motion of the right arm.
  • Figure A: The visualization of simulated tasks and real-world tasks. The task with "(R)" means the real-world tasks.
  • ...and 3 more figures