Table of Contents
Fetching ...

Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

Chen-Xiao Gao, Chenyang Wu, Mingjun Cao, Chenjun Xiao, Yang Yu, Zongzhang Zhang

TL;DR

This work targets offline RL where exploration of unseen actions is hazardous. It extends behavior-regularized objectives to diffusion-based policies by introducing a pathwise KL penalty along the reverse-diffusion trajectory and develops a two-time-scale actor–critic algorithm (BDPO) that uses diffusion-step value functions to efficiently optimize the policy. A key theoretical result shows the pathwise KL objective is equivalent to the standard KL-regularized RL objective, enabling tractable penalties per diffusion step. Empirically, BDPO achieves competitive or superior performance on synthetic 2D tasks and D4RL continuous-control benchmarks, with favorable runtime characteristics and robust sensitivity to regularization parameters.

Abstract

Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.

Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning

TL;DR

This work targets offline RL where exploration of unseen actions is hazardous. It extends behavior-regularized objectives to diffusion-based policies by introducing a pathwise KL penalty along the reverse-diffusion trajectory and develops a two-time-scale actor–critic algorithm (BDPO) that uses diffusion-step value functions to efficiently optimize the policy. A key theoretical result shows the pathwise KL objective is equivalent to the standard KL-regularized RL objective, enabling tractable penalties per diffusion step. Empirically, BDPO achieves competitive or superior performance on synthetic 2D tasks and D4RL continuous-control benchmarks, with favorable runtime characteristics and robust sensitivity to regularization parameters.

Abstract

Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.

Paper Structure

This paper contains 26 sections, 8 theorems, 56 equations, 18 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.2

(Proof in Appendix thm:pathwise_kl_equivalence) Let $p^\nu$ be the behavior diffusion process. The optimal diffusion policy $p^{*}$ of the pathwise KL-regularized RL problem in Eq. (eq:pathwise_kl_obj) is also the optimal policy $\pi^*$ of the KL regularized objective in Eq. (eq:brl_obj), in the sen

Figures (18)

  • Figure 1: Illustration of the behavior-regularized RL framework with different policy parameterizations. Unimodal policies, such as deterministic policies (left), compute the behavior as the center of mass and therefore lead to misleading regularizations; while our method (right) harnesses the flexibility of diffusion models, and the regularization is calculated as the accumulated discrepancies in diffusion directions of the actor and the behavior diffusion.
  • Figure 2: Semantic illustration of the interplay between diffusion policies and the environment. We use orange to denote the transition $p_{n-1|n}^{\pi,s_t,a^n}$ and the penalty $\ell_{n}^{\pi,s_t}$ (see Section \ref{['sec:pathwise_kl']}) associated with the diffusion generation process, whereas blue signifies the transition $T(\cdot|s_t,a_t^0)$ and the reward $r_t$ from the original environment MDP.
  • Figure 3: Semantic illustration of the TD backup for the Q-value function $Q^\pi$ (blue) and diffusion value function $V^{\pi,s}_n$ (orange). The update of $Q^\pi$ (Eq. (\ref{['eq:upper_critic']})) requires reward, penalties along the diffusion trajectory, and the Q-values at the next state. The update of $V^{\pi,s}_n$ (Eq. (\ref{['eq:intermediate_value']})) involves the single-step penalty and the diffusion value at the next diffusion step $n-1$.
  • Figure 4: Generation paths of BDPO on the 8gaussian (top), 2spirals (middle), and moons (down) datasets. The regularization strength is set to $\eta=0.06$, which is identical to Figure \ref{['fig:2d_data']}. The first five columns depict the diffusion generation process at different time intervals, with green dots indicating the starting points of these intervals, red dots indicating the ending points, and grey lines in between representing intermediate samples. The background color depicts the output from the diffusion value functions $V^{\pi,s}_n$ in the entire 2D space. The rightmost figures depict the final action samples. We use DDIM sampling ddim for better illustration.
  • Figure 5: Illustration of the 8gaussians, 2spirals, and moons datasets. The top row depicts the original data distribution $p_{\text{data}}$, while the second row depicts the target distribution $p_{\text{target}}$ at $\eta=0.06$ by re-sampling data points according to their energies.
  • ...and 13 more figures

Theorems & Definitions (14)

  • Definition 4.1
  • Theorem 4.2
  • Proposition 4.4
  • Proposition 4.5
  • Theorem 3.1
  • proof
  • Theorem 3.2: Theorem 4.2 in the main text
  • proof
  • Lemma 3.3: Soft Policy Evaluation (adapted from sac)
  • proof
  • ...and 4 more