Table of Contents
Fetching ...

Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance

Dexin Wang, Chunsheng Liu, Faliang Chang, Yichen Xu

TL;DR

A set of key technical contributions including snapshot gradient optimization, 3D conditioning, and prompt guidance are proposed, which improve the policy's optimization efficiency, spatial awareness, and controllability respectively.

Abstract

Decision-making in robotics using denoising diffusion processes has increasingly become a hot research topic, but end-to-end policies perform poorly in tasks with rich contact and have limited controllability. This paper proposes Hierarchical Diffusion Policy (HDP), a new imitation learning method of using objective contacts to guide the generation of robot trajectories. The policy is divided into two layers: the high-level policy predicts the contact for the robot's next object manipulation based on 3D information, while the low-level policy predicts the action sequence toward the high-level contact based on the latent variables of observation and contact. We represent both level policies as conditional denoising diffusion processes, and combine behavioral cloning and Q-learning to optimize the low level policy for accurately guiding actions towards contact. We benchmark Hierarchical Diffusion Policy across 6 different tasks and find that it significantly outperforms the existing state of-the-art imitation learning method Diffusion Policy with an average improvement of 20.8%. We find that contact guidance yields significant improvements, including superior performance, greater interpretability, and stronger controllability, especially on contact-rich tasks. To further unlock the potential of HDP, this paper proposes a set of key technical contributions including snapshot gradient optimization, 3D conditioning, and prompt guidance, which improve the policy's optimization efficiency, spatial awareness, and controllability respectively. Finally, real world experiments verify that HDP can handle both rigid and deformable objects.

Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance

TL;DR

A set of key technical contributions including snapshot gradient optimization, 3D conditioning, and prompt guidance are proposed, which improve the policy's optimization efficiency, spatial awareness, and controllability respectively.

Abstract

Decision-making in robotics using denoising diffusion processes has increasingly become a hot research topic, but end-to-end policies perform poorly in tasks with rich contact and have limited controllability. This paper proposes Hierarchical Diffusion Policy (HDP), a new imitation learning method of using objective contacts to guide the generation of robot trajectories. The policy is divided into two layers: the high-level policy predicts the contact for the robot's next object manipulation based on 3D information, while the low-level policy predicts the action sequence toward the high-level contact based on the latent variables of observation and contact. We represent both level policies as conditional denoising diffusion processes, and combine behavioral cloning and Q-learning to optimize the low level policy for accurately guiding actions towards contact. We benchmark Hierarchical Diffusion Policy across 6 different tasks and find that it significantly outperforms the existing state of-the-art imitation learning method Diffusion Policy with an average improvement of 20.8%. We find that contact guidance yields significant improvements, including superior performance, greater interpretability, and stronger controllability, especially on contact-rich tasks. To further unlock the potential of HDP, this paper proposes a set of key technical contributions including snapshot gradient optimization, 3D conditioning, and prompt guidance, which improve the policy's optimization efficiency, spatial awareness, and controllability respectively. Finally, real world experiments verify that HDP can handle both rigid and deformable objects.

Paper Structure

This paper contains 29 sections, 21 equations, 16 figures, 9 tables, 4 algorithms.

Figures (16)

  • Figure 1: Inference Process of Hierarchical Diffusion Policy. Compared to the Diffusion Policy that generates operation trajectories end-to-end, HDP introduces objective contacts, predicted by the Guider network or provided by humans, to guide the trajectory generation.
  • Figure 2: Hierarchical Diffusion Policy Overview.(a) At time step $t$ during inference, the Guider takes the latest $T_o$ steps of observation data $\mathbf{O_t}$ as input and predicts objective contact $\mathbf{C_t}$, the Actor takes observation data $\mathbf{O_t}$ and objective contact $\mathbf{C_t}$ as input and predicts $T_p$ steps of actions, of which $T_a$ steps of actions are executed on the robot without re-planning. During training, in addition to minimizing the prediction error of the Actor, Guider, and Critic compared to the ground truth, the Actor's weights are also optimized by maximizing the Q-values. (b) The Actor and Guider are modeled as conditional denoising diffusion models, with their networks built on one-dimensional convolutions and linear layers, respectively. The Actor's architecture is similar to that of the Diffusion Policy. Critic is built based on linear layers.
  • Figure 3: Phased objective contacts. The algorithm consists of three steps: recording object subgoals and contacts, merging misoperations and no-contact operations, and configuring objective contacts. Misoperations include ① overshooting the movement and ② not moving or barely moving the object. The algorithm eliminates contacts related to misoperations by detecting object pose similarity, reducing suboptimal objective contacts (last row). Additionally, this process also removes the contacts related to operations near subgoals, resulting in an early update of the objective contact (the end of the last row), which is experimentally proven not to impair performance (Fig. \ref{['fig_result_oc']}).
  • Figure 4: Q-learning Ablation Study. HDP with oneshot denoising is better and robust than iterative denoising. A larger coefficient leads to greater optimization intensity and more optimization conflicts. The sensitivity of HDP to coefficient changes and gradient imbalance caused by iterative denoising increases as the number of training samples decreases.
  • Figure 5: Multimodal behavior. In the initial state, the robot can pull the left side or push the right side to move the object to the target position (black wireframe). The warm and cold curves represent the motion trajectories of the left and right fingers of the gripper, respectively. The two end points of the black dashed line are the objective contacts predicted by the Hierarchical Diffusion Policy. Hierarchical Diffusion Policy learns both patterns and executes exactly one of them in each deployment. Diffusion Policy is biased toward one mode, and has little success. Trajectories are generated by executing 10 times.
  • ...and 11 more figures