PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer
Chang Chen, Junyeob Baek, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, Sungjin Ahn
TL;DR
PlanDQ tackles the crack between long-horizon sparse-reward and short-horizon dense-reward offline RL by a hierarchical approach that combines a diffusion-based high-level planner (D-Conductor) with a diffusion-based low-level policy (Q-Performer) optimized via Q-learning objectives. The method generates sub-goals at the high level and solves sub-tasks with a goal-conditioned diffusion policy, aided by a TD-based value update and intrinsic rewards to improve credit assignment. Across D4RL benchmarks and long-horizon tasks (AntMaze, Kitchen, Calvin), PlanDQ delivers competitive to superior performance, while ablations reveal the value-learning component is crucial in noisy, short-horizon settings and that high-level Q-conductors may underperform in dense rewards. The work highlights the complementary strengths of diffusion planning and value-based learning for offline RL, offering practical gains for complex, hierarchical tasks and informing future directions in adaptive sub-goal scheduling and efficiency improvements.
Abstract
Despite the recent advancements in offline RL, no unified algorithm could achieve superior performance across a broad range of tasks. Offline \textit{value function learning}, in particular, struggles with sparse-reward, long-horizon tasks due to the difficulty of solving credit assignment and extrapolation errors that accumulates as the horizon of the task grows.~On the other hand, models that can perform well in long-horizon tasks are designed specifically for goal-conditioned tasks, which commonly perform worse than value function learning methods on short-horizon, dense-reward scenarios. To bridge this gap, we propose a hierarchical planner designed for offline RL called PlanDQ. PlanDQ incorporates a diffusion-based planner at the high level, named D-Conductor, which guides the low-level policy through sub-goals. At the low level, we used a Q-learning based approach called the Q-Performer to accomplish these sub-goals. Our experimental results suggest that PlanDQ can achieve superior or competitive performance on D4RL continuous control benchmark tasks as well as AntMaze, Kitchen, and Calvin as long-horizon tasks.
