Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning
Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang
TL;DR
This work targets offline RL where exploration of unseen actions is hazardous. It extends behavior-regularized objectives to diffusion-based policies by introducing a pathwise KL penalty along the reverse-diffusion trajectory and develops a two-time-scale actor–critic algorithm (BDPO) that uses diffusion-step value functions to efficiently optimize the policy. A key theoretical result shows the pathwise KL objective is equivalent to the standard KL-regularized RL objective, enabling tractable penalties per diffusion step. Empirically, BDPO achieves competitive or superior performance on synthetic 2D tasks and D4RL continuous-control benchmarks, with favorable runtime characteristics and robust sensitivity to regularization parameters.
Abstract
Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.
