Score Regularized Policy Optimization through Diffusion Behavior
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu
TL;DR
SRPO tackles the bottleneck of diffusion-based offline RL by extracting a deterministic policy via gradient-level score regularization. It uses a pretrained diffusion behavior model to estimate the score of the behavior distribution and regularizes the policy gradient against this score, avoiding expensive diffusion sampling during both training and evaluation. The method combines implicit Q-learning with diffusion-based behavior modeling and ensemble diffusion-time scores, achieving large action-sampling speedups (25x–1000x) while maintaining state-of-the-art or near-state-of-the-art performance on D4RL locomotion tasks. Ablation studies justify design choices such as ensemble times, weighting, and baselines, highlighting SRPO’s robustness and practical applicability to computation-sensitive domains like robotics.
Abstract
Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.
