Table of Contents
Fetching ...

Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation

Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider

TL;DR

The paper addresses the bottleneck of slow diffusion-based planners in offline RL by introducing Reward-Aware Consistency Trajectory Distillation (RACTD), which uses a pre-trained diffusion teacher and a separate reward model to guide a single-step, noise-free distillation. By incorporating a reward objective into the consistency trajectory losses and employing decoupled training, RACTD achieves competitive or superior performance with substantial inference-time speedups compared to existing diffusion-based and actor-critic methods. The method demonstrates strong gains on multi-modal, suboptimal offline datasets (Gym-MuJoCo, FrankaKitchen) and excels in long-horizon planning tasks (Maze2d), while maintaining a simpler training pipeline. Overall, RACTD offers a practical, scalable approach to accelerate diffusion planners in offline RL with robust mode selection toward high-reward trajectories and minimal training complexity.

Abstract

Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.

Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation

TL;DR

The paper addresses the bottleneck of slow diffusion-based planners in offline RL by introducing Reward-Aware Consistency Trajectory Distillation (RACTD), which uses a pre-trained diffusion teacher and a separate reward model to guide a single-step, noise-free distillation. By incorporating a reward objective into the consistency trajectory losses and employing decoupled training, RACTD achieves competitive or superior performance with substantial inference-time speedups compared to existing diffusion-based and actor-critic methods. The method demonstrates strong gains on multi-modal, suboptimal offline datasets (Gym-MuJoCo, FrankaKitchen) and excels in long-horizon planning tasks (Maze2d), while maintaining a simpler training pipeline. Overall, RACTD offers a practical, scalable approach to accelerate diffusion planners in offline RL with robust mode selection toward high-reward trajectories and minimal training complexity.

Abstract

Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While consistency models offer a potential solution, existing applications to decision-making either struggle with suboptimal demonstrations under behavior cloning or rely on complex concurrent training of multiple networks under the actor-critic framework. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method achieves single-step sampling while generating higher-reward action trajectories through decoupled training and noise-free reward signals. Empirical evaluations on the Gym MuJoCo, FrankaKitchen, and long horizon planning benchmarks demonstrate that our approach can achieve a 9.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.

Paper Structure

This paper contains 49 sections, 13 equations, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Overview of Reward Aware Consistency Trajectory Distillation (RACTD). We incorporate reward guidance with consistency trajectory distillation to train a student model that can generate actions with high rewards with only one denoising step.
  • Figure 2: Visualization of CTM loss, DSM loss and reward loss.
  • Figure 3: The reward distribution of the D4RL hopper-medium-expert dataset and 100 rollouts from an unconditioned teacher, an unconditioned student, and RACTD.
  • Figure 4: Wall clock time and NFEs per action for different samplers and Diffuser on MuJoCo hopper-medium-replay.
  • Figure 4: Ablation on reward objective weight.