Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider
TL;DR
The paper addresses the bottleneck of slow diffusion-based planners in offline RL by introducing Reward-Aware Consistency Trajectory Distillation (RACTD), which uses a pre-trained diffusion teacher and a separate reward model to guide a single-step, noise-free distillation. By incorporating a reward objective into the consistency trajectory losses and employing decoupled training, RACTD achieves competitive or superior performance with substantial inference-time speedups compared to existing diffusion-based and actor-critic methods. The method demonstrates strong gains on multi-modal, suboptimal offline datasets (Gym-MuJoCo, FrankaKitchen) and excels in long-horizon planning tasks (Maze2d), while maintaining a simpler training pipeline. Overall, RACTD offers a practical, scalable approach to accelerate diffusion planners in offline RL with robust mode selection toward high-reward trajectories and minimal training complexity.
Abstract
Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.
