Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

Zihan Ding; Chi Jin

Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

Zihan Ding, Chi Jin

TL;DR

Diffusion-based policies model multi-modal actions but incur slow inference, limiting real-time RL. This work introduces a consistency-model policy (consistency policy) with two instantiations, Consistency-BC and Consistency-AC, that use a conditional function $f_\theta(c, \mathbf{x}_\tau, \tau)$ to map noisy actions back toward high-probability actions via a probability-flow ODE $\frac{d\mathbf{x}_\tau}{d\tau}=-\tau \nabla \log p_\tau(\mathbf{x})$. Across offline, offline-to-online, and online RL on D4RL benchmarks, the consistency policy achieves competitive or superior performance relative to diffusion policies while offering substantial speedups in training and action inference, due to few denoising steps and faster sampling. The approach is augmented with a loss-scaling scheme ($\lambda(\tau_n, \tau_{n+1};k)$) and an actor-critic objective that backpropagates through the consistency model, enabling robust offline learning and efficient online fine-tuning. Overall, the consistency policy provides a practical, scalable alternative for multi-modal RL with improved compute efficiency without sacrificing much accuracy.

Abstract

Score-based generative models like the diffusion model have been testified to be effective in modeling multi-modal data from image generation to reinforcement learning (RL). However, the inference process of diffusion model can be slow, which hinders its usage in RL with iterative sampling. We propose to apply the consistency model as an efficient yet expressive policy representation, namely consistency policy, with an actor-critic style algorithm for three typical RL settings: offline, offline-to-online and online. For offline RL, we demonstrate the expressiveness of generative models as policies from multi-modal data. For offline-to-online RL, the consistency policy is shown to be more computational efficient than diffusion policy, with a comparable performance. For online RL, the consistency policy demonstrates significant speedup and even higher average performances than the diffusion policy.

Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

TL;DR

to map noisy actions back toward high-probability actions via a probability-flow ODE

. Across offline, offline-to-online, and online RL on D4RL benchmarks, the consistency policy achieves competitive or superior performance relative to diffusion policies while offering substantial speedups in training and action inference, due to few denoising steps and faster sampling. The approach is augmented with a loss-scaling scheme (

) and an actor-critic objective that backpropagates through the consistency model, enabling robust offline learning and efficient online fine-tuning. Overall, the consistency policy provides a practical, scalable alternative for multi-modal RL with improved compute efficiency without sacrificing much accuracy.

Abstract

Paper Structure (34 sections, 5 equations, 12 figures, 10 tables, 5 algorithms)

This paper contains 34 sections, 5 equations, 12 figures, 10 tables, 5 algorithms.

Introduction
Related Works
Offline and Offline-to-Online RL.
Score-based Generative Model for RL.
Preliminaries
Offline and Online RL
Consistency Model
Consistency Model as RL Policy
Consistency Action Inference.
Consistency Behavior Cloning.
Consistency Actor-Critic.
Loss Scaling.
Experimental Evaluation
Offline RL: Behavior Cloning with Expressive Policy Representation
Offline RL: Consistency Actor-Critic
...and 19 more sections

Figures (12)

Figure 1: Average training time (seconds per epoch) for Consistency-BC and Diffusion-BC across tasks.
Figure 2: The average normalized scores and training time versus $N$ for two models on hopper-medium-expert.
Figure 3: Comparison of variants of Consistency-AC across tasks in offline RL setting.
Figure 4: Learning curves of Diffusion-QL and Consistency-AC for online RL and offline-to-online RL with offline model selection in time axis (all trained with one million environment steps). Each curve is smoothed and averaged over five random seeds, and shaded regions show the $95\%$ confidence interval.
Figure 5: Visualization of t-SNE plots for 10000 (3000 for pen-human-v1 and kitchen-complete-v0) randomly selected $(s,a)$ samples in D4RL dataset, colored by normalized reward (range $[-1, 1]$).
...and 7 more figures

Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

TL;DR

Abstract

Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)