Table of Contents
Fetching ...

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, Bing-Yi Jing

TL;DR

The paper tackles offline reinforcement learning under out-of-distribution action risks by introducing Diffusion Actor-Critic (DAC), which treats constrained policy iteration as diffusion noise regression. DAC models the target policy as a diffusion model and introduces a soft Q-guidance term, along with a lower confidence bound (LCB) from a Q-ensemble, to stabilize learning and prevent OOD actions without explicit density estimation. It also employs policy extraction via sampling multiple diffusion-generated actions and selecting the best by Q-ensemble value, achieving strong performance and convergence stability on D4RL benchmarks. The approach reduces training time by avoiding gradient propagation through the denoising path and demonstrates state-of-the-art results across most tasks, highlighting practical impact for offline RL.

Abstract

In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. One class of methods, the policy-regularized method, addresses this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm in which we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance is based on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. We demonstrate that such diffusion-based policy constraint, along with the coupling of the lower confidence bound of the Q-ensemble as value targets, not only preserves the multi-modality of target policies, but also contributes to stable convergence and strong performance in DAC. Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments. Code is available at https://github.com/Fang-Lin93/DAC.

Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

TL;DR

The paper tackles offline reinforcement learning under out-of-distribution action risks by introducing Diffusion Actor-Critic (DAC), which treats constrained policy iteration as diffusion noise regression. DAC models the target policy as a diffusion model and introduces a soft Q-guidance term, along with a lower confidence bound (LCB) from a Q-ensemble, to stabilize learning and prevent OOD actions without explicit density estimation. It also employs policy extraction via sampling multiple diffusion-generated actions and selecting the best by Q-ensemble value, achieving strong performance and convergence stability on D4RL benchmarks. The approach reduces training time by avoiding gradient propagation through the denoising path and demonstrates state-of-the-art results across most tasks, highlighting practical impact for offline RL.

Abstract

In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. One class of methods, the policy-regularized method, addresses this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm in which we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance is based on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. We demonstrate that such diffusion-based policy constraint, along with the coupling of the lower confidence bound of the Q-ensemble as value targets, not only preserves the multi-modality of target policies, but also contributes to stable convergence and strong performance in DAC. Our approach is evaluated on D4RL benchmarks and outperforms the state-of-the-art in nearly all environments. Code is available at https://github.com/Fang-Lin93/DAC.
Paper Structure (27 sections, 3 theorems, 32 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 3 theorems, 32 equations, 11 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $\boldsymbol{\epsilon}^*(\mathbf{x}_t, \mathbf{s}, t) := - \sqrt{1-\bar{\alpha}_t} \, \nabla_{\mathbf{x}_t} \log p_t^* (\mathbf{x}_t|\mathbf{s})$. Then $\boldsymbol{\epsilon}^*(\mathbf{x}_t, \mathbf{s}, t)$ is a Gaussian noise predictor which defines a diffusion model for generating $\pi^*_{k+1}

Figures (11)

  • Figure 1: A visual explanation of generating actions from noisy prior $x_T$ using (a) denoised Q-guidance (b) hard Q-guidance and (c) soft Q-guidance. The soft Q-guidance reduces the intensity of Q-guidance during the denoising steps, generating high-reward actions within the behavior support without the need to backpropagate the gradient through the path.
  • Figure 2: Comparison of generated policies on 2-dimensional bandit using different Q-gradient guidance. We compare soft Q-guidance (magenta) against hard Q-guidance (blue) that eliminates the noise scaling factor and denoised Q-guidance wang2022diffusion (brown) on 2-D bandit examples. The dots are behavior policies, which are colored based on the reward value. The dashed level curves represent the estimated Q-value field. Soft Q-guidance are capable of generating high-reward actions while remaining within the behavior support. We also observe that soft Q-guidance captures the multi-modality of target policies as shown in the second plot. Experimental details can be found in Appendix \ref{['app_2d_eg']}.
  • Figure 3: Training curves of DAC with different Q-gradient guidance. We compare soft Q-guidance (soft), hard Q-guidance (hard) and the denoised Q-guidance (denoised) on locomotion tasks. DAC with soft Q-guidance achieves stable convergence and strong performance across all the tasks.
  • Figure 4: Experiments of using different Q-ensemble sizes $(H)$ on hopper tasks.
  • Figure 5: Experiments of using different number of action samples $(N_a)$ on hopper tasks.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • proof
  • proof
  • proof