Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation
Xiao Ma, Sumit Patidar, Iain Haughton, Stephen James
TL;DR
Hierarchical Diffusion Policy (HDP) tackles long-horizon, multi-task robotic manipulation by factorising control into a high-level next-best end-effector pose predictor and a low-level, kinematics-aware diffusion controller, RK-Diffuser. The high-level policy uses language-conditioned PerAct with 3D voxel scene representations to propose a 6-DoF pose and gripper target, while RK-Diffuser generates a trajectory in joint space through dual diffusion models for pose and joints, augmented by differentiable kinematics to distill pose accuracy into feasible joint motions. Key innovations include keyframe-driven training, end-to-end differentiable kinematics for joint-space inpainting, and trajectory ranking to bias toward efficient paths, all enabling robust multi-task performance in RLBench and successful real-world trials with limited demonstrations. The approach achieves state-of-the-art results on challenging tasks and demonstrates practical viability for real robots, highlighting the potential of combining hierarchical diffusion with kinematics-aware control for scalable, context-aware manipulation.
Abstract
This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP), and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints, we present a novel kinematics-aware goal-conditioned control agent, Robot Kinematics Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the end-effector pose and joint position trajectories, and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically, we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world.
