Table of Contents
Fetching ...

Score Regularized Policy Optimization through Diffusion Behavior

Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu

TL;DR

SRPO tackles the bottleneck of diffusion-based offline RL by extracting a deterministic policy via gradient-level score regularization. It uses a pretrained diffusion behavior model to estimate the score of the behavior distribution and regularizes the policy gradient against this score, avoiding expensive diffusion sampling during both training and evaluation. The method combines implicit Q-learning with diffusion-based behavior modeling and ensemble diffusion-time scores, achieving large action-sampling speedups (25x–1000x) while maintaining state-of-the-art or near-state-of-the-art performance on D4RL locomotion tasks. Ablation studies justify design choices such as ensemble times, weighting, and baselines, highlighting SRPO’s robustness and practical applicability to computation-sensitive domains like robotics.

Abstract

Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.

Score Regularized Policy Optimization through Diffusion Behavior

TL;DR

SRPO tackles the bottleneck of diffusion-based offline RL by extracting a deterministic policy via gradient-level score regularization. It uses a pretrained diffusion behavior model to estimate the score of the behavior distribution and regularizes the policy gradient against this score, avoiding expensive diffusion sampling during both training and evaluation. The method combines implicit Q-learning with diffusion-based behavior modeling and ensemble diffusion-time scores, achieving large action-sampling speedups (25x–1000x) while maintaining state-of-the-art or near-state-of-the-art performance on D4RL locomotion tasks. Ablation studies justify design choices such as ensemble times, weighting, and baselines, highlighting SRPO’s robustness and practical applicability to computation-sensitive domains like robotics.

Abstract

Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
Paper Structure (30 sections, 3 theorems, 30 equations, 16 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 3 theorems, 30 equations, 16 figures, 2 tables, 1 algorithm.

Key Result

Proposition 2

(Proof in Appendix sec:analysis) Given that $\pi_\theta$ is deterministic (${\bm{a}}=\pi_\theta({\bm{s}})$) such that $\pi_{\theta, t}$ is Gaussian (${\bm{a}}_t = \alpha_t {\bm{a}} + \sigma_t\bm{\epsilon}$, $\bm{\epsilon}\sim{\mathcal{N}}(\bm{0},{\bm{I}})$), the gradient for optimizing $\max {\math

Figures (16)

  • Figure 1: Performance and computational efficiency of different algorithms in D4RL Locomotion tasks. Computation time is assessed using a consistent hardware setup and PyTorch backend.
  • Figure 2: Comparison of different policy extraction methods under bandit settings. Forward KL policy extraction is prone to generate out-of-support actions if the policy is not sufficiently expressive (e.g., Gaussians). This can be mitigated either by employing a more expressive policy class or by switching to a reverse KL objective (our method), which demonstrates a mode-seeking nature.
  • Figure 3: Illustration of SRPO in 2D bandit settings. (a) A predefined complex data distribution, which represents the potentially heterogeneous behavior policy $\mu({\bm{a}})$. (b) A diffusion model $\hat{\mu}({\bm{a}})$ is trained to fit the behavior distribution. The data density can be analytically calculated based on sde. (c) The Q-function is manually defined as a quadratic function: $Q({\bm{a}}):=-({\bm{a}} - {\bm{a}}_{\text{tar}})^2$, where ${\bm{a}}_{\text{tar}}$ represents the 2D point with the highest estimated Q-value and is selected from a set of grid intersections. These individual Q-functions with different ${\bm{a}}_{\text{tar}}$ are depicted together in a stacked way in Figure (c). (d)&(e) By optimizing deterministic policies $\pi(\cdot) = {\bm{a}}_{\text{reg}}$ according to Eq. (\ref{['eq:ideal_objective_gradient']}) and tuning the temperature coefficient $\beta$, resulting policies shift from greedy ones which tend to maximize corresponding Q-functions to conservative ones which are successfully constrained close to the behavior distribution. See more experimental results in Appendix \ref{['appendix:toy_more']}.
  • Figure 4: Performance of other behavior regularization methods. See more results in Appendix \ref{['appendix:toy_more']}.
  • Figure 5: Empirical benefits of ensembling multiple diffusion times. See Remark \ref{['remark1_label']} in Appendix \ref{['sec:analysis']} for a detailed explanation.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Proposition 2
  • Proposition 1
  • proof
  • Remark 1
  • Proposition 2
  • proof
  • Remark 2