PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer

Chang Chen; Junyeob Baek; Fei Deng; Kenji Kawaguchi; Caglar Gulcehre; Sungjin Ahn

PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer

Chang Chen, Junyeob Baek, Fei Deng, Kenji Kawaguchi, Caglar Gulcehre, Sungjin Ahn

TL;DR

PlanDQ tackles the crack between long-horizon sparse-reward and short-horizon dense-reward offline RL by a hierarchical approach that combines a diffusion-based high-level planner (D-Conductor) with a diffusion-based low-level policy (Q-Performer) optimized via Q-learning objectives. The method generates sub-goals at the high level and solves sub-tasks with a goal-conditioned diffusion policy, aided by a TD-based value update and intrinsic rewards to improve credit assignment. Across D4RL benchmarks and long-horizon tasks (AntMaze, Kitchen, Calvin), PlanDQ delivers competitive to superior performance, while ablations reveal the value-learning component is crucial in noisy, short-horizon settings and that high-level Q-conductors may underperform in dense rewards. The work highlights the complementary strengths of diffusion planning and value-based learning for offline RL, offering practical gains for complex, hierarchical tasks and informing future directions in adaptive sub-goal scheduling and efficiency improvements.

Abstract

Despite the recent advancements in offline RL, no unified algorithm could achieve superior performance across a broad range of tasks. Offline \textit{value function learning}, in particular, struggles with sparse-reward, long-horizon tasks due to the difficulty of solving credit assignment and extrapolation errors that accumulates as the horizon of the task grows.~On the other hand, models that can perform well in long-horizon tasks are designed specifically for goal-conditioned tasks, which commonly perform worse than value function learning methods on short-horizon, dense-reward scenarios. To bridge this gap, we propose a hierarchical planner designed for offline RL called PlanDQ. PlanDQ incorporates a diffusion-based planner at the high level, named D-Conductor, which guides the low-level policy through sub-goals. At the low level, we used a Q-learning based approach called the Q-Performer to accomplish these sub-goals. Our experimental results suggest that PlanDQ can achieve superior or competitive performance on D4RL continuous control benchmark tasks as well as AntMaze, Kitchen, and Calvin as long-horizon tasks.

PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer

TL;DR

Abstract

Paper Structure (24 sections, 14 equations, 10 figures, 8 tables, 3 algorithms)

This paper contains 24 sections, 14 equations, 10 figures, 8 tables, 3 algorithms.

Introduction
Preliminaries
Offline Reinforcement Learning
Diffusion Probabilistic Models
Diffusion Models for Reinforcement Learning
Method
D-Conductor
Q-Performer
Orchestrating Family
Experiment
Experimental Setup
Long-horizon Navigation and Manipulation
Short-horizon Controlling
Analysis and Ablation Study
Related Works
...and 9 more sections

Figures (10)

Figure 1: Overview. To explore the optimal hierarchical planning architecture, we examine four hierarchical modules, check 1-(a), consisting of two conductors for high-level components and two performers for low-level components. And, we introduce a novel hierarchical planning architecture, called PlanDQ, which takes the D-conductor as a high-level component and the Q-performer as a low-level component, check 1-(b) The planning process of PlanDQ.
Figure 2: Value Estimation Comparison. Compared with Diffusion-QL, Diffuser learns a more noisy value function.
Figure 3: Model Exploration on the Gym-MuJoCo. PlanDQ achieves the best averaged performance over all the model variants.
Figure 4: PlanDQ with different Low-level rewards scheme on the Gym-MuJoCo. The combination of external rewards with intrinsic rewards achieves the best-averaged performance over different qualities of dataset.
Figure 5: Benchmark Performance. The graph presents performance results comparing three main baselines: DQL, HD, and our proposed approach, PlanDQ. The evaluation is conducted across six environments: three long-horizon tasks (AntMaze, Kitchen, and Calvin) and three short-horizon tasks (D4RL-Locomotion-Medium-Replay, Medium, and Medium-Expert). Notably, the scores are normalized to the maximum performance of each benchmark.
...and 5 more figures

Theorems & Definitions (1)

Example 1

PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer

TL;DR

Abstract

PlanDQ: Hierarchical Plan Orchestration via D-Conductor and Q-Performer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (1)