Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Xiao Ma; Sumit Patidar; Iain Haughton; Stephen James

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Xiao Ma, Sumit Patidar, Iain Haughton, Stephen James

TL;DR

Hierarchical Diffusion Policy (HDP) tackles long-horizon, multi-task robotic manipulation by factorising control into a high-level next-best end-effector pose predictor and a low-level, kinematics-aware diffusion controller, RK-Diffuser. The high-level policy uses language-conditioned PerAct with 3D voxel scene representations to propose a 6-DoF pose and gripper target, while RK-Diffuser generates a trajectory in joint space through dual diffusion models for pose and joints, augmented by differentiable kinematics to distill pose accuracy into feasible joint motions. Key innovations include keyframe-driven training, end-to-end differentiable kinematics for joint-space inpainting, and trajectory ranking to bias toward efficient paths, all enabling robust multi-task performance in RLBench and successful real-world trials with limited demonstrations. The approach achieves state-of-the-art results on challenging tasks and demonstrates practical viability for real robots, highlighting the potential of combining hierarchical diffusion with kinematics-aware control for scalable, context-aware manipulation.

Abstract

This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP), and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints, we present a novel kinematics-aware goal-conditioned control agent, Robot Kinematics Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the end-effector pose and joint position trajectories, and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically, we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world.

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

TL;DR

Abstract

Paper Structure (26 sections, 11 equations, 6 figures, 3 tables)

This paper contains 26 sections, 11 equations, 6 figures, 3 tables.

Introduction
Related Works
End-to-End Visual Manipulation Agents
Diffusion Models
Differentiable Physics for Decision Making
Preliminaries
Diffusion Models
Differentiable Kinematics
Hierarchical Diffusion Policy
Dataset Preparation
High-Level Next-Best Pose Agent
Low-Level RK-Diffuser
Practical Implementation Choices
Experiments
Trajectory Visualisations
...and 11 more sections

Figures (6)

Figure 1: We introduce HDP, a hierarchical agent for robotic manipulation. At the high-level, HDP learns to predict the next-best end-effector pose. Conditioned on the current and the predicted pose (red), a diffusion model generates an action trajectory for the robot to follow (blue). In contrast, the trajectories generated by classic planners (yellow) cannot be executed due to violating environment constraints, e.g., the hinge of the box.
Figure 2: We focus on learning multi-task language-guided agent for robotic manipulation. Unlike a standard motion planner that only samples an arbitrary trajectory to the end pose.
Figure 3: Overview of Hierarchical Diffusion Policy (HDP). HDP is a multi-task hierarchical agent for kinematics-aware robotic manipulation. HDP consists of two levels: a high-level language-guided agent and a low-level goal-conditioned diffusion policy. From left to right, the high-level agent takes in 3D environment observations and language instructions, then predicts the next-best end-effector pose. This pose guides the low-level RK-Diffuser. The RK-Diffuser subsequently generates a continuous joint-position trajectory by conditional sampling and trajectory inpainting given the next-best pose and environment observations. To generate kinematics-aware trajectories, RK-Diffuser distills the accurate but less flexible end-effector pose trajectories into joint position space via differentiable robot kinematics.
Figure 4: Trajectory visualisations of the open box task.
Figure 5: Real-robot execution sequences. For both tasks, the robot needs to accurately predict the trajectories that understand the task context conditioned on languages. As appliances have high resistance force, a slight deviation from the expected trajectory would cause the robot to fail because of exceeding the joint torque limit.
...and 1 more figures

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

TL;DR

Abstract

Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)