Table of Contents
Fetching ...

Diffusion Actor-Critic with Entropy Regulator

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, Shengbo Eben Li

TL;DR

This work addresses the limited expressivity of traditional policy representations in reinforcement learning by introducing a diffusion-based policy that can capture multimodal action distributions. The authors treat the reverse diffusion process as the policy and enhance exploration through an entropy regulator that estimates entropy with a Gaussian mixture model and adapts a learnable parameter $\alpha$ to scale added noise. Integrated into an online actor-critic framework, DACER optimizes actions via backpropagation through the diffusion chain while learning robust value estimates with double Q-learning and target networks. Empirical results on MuJoCo benchmarks demonstrate state-of-the-art performance in several tasks and reveal strong multimodal policy representations, with open-source JAX implementation to support reproducibility and further research.

Abstract

Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter $α$ that modulates the degree of exploration and exploitation. Parameter $α$ will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.

Diffusion Actor-Critic with Entropy Regulator

TL;DR

This work addresses the limited expressivity of traditional policy representations in reinforcement learning by introducing a diffusion-based policy that can capture multimodal action distributions. The authors treat the reverse diffusion process as the policy and enhance exploration through an entropy regulator that estimates entropy with a Gaussian mixture model and adapts a learnable parameter to scale added noise. Integrated into an online actor-critic framework, DACER optimizes actions via backpropagation through the diffusion chain while learning robust value estimates with double Q-learning and target networks. Empirical results on MuJoCo benchmarks demonstrate state-of-the-art performance in several tasks and reveal strong multimodal policy representations, with open-source JAX implementation to support reproducibility and further research.

Abstract

Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter that modulates the degree of exploration and exploitation. Parameter will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.
Paper Structure (29 sections, 16 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 29 sections, 16 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Training curves on benchmarks. The solid lines represent the mean, while the shaded regions indicate the 95% confidence interval over five runs. The iteration of PPO and TRPO is measured by the number of network updates.
  • Figure 2: Policy representation comparison of different policies on a multimodal environment. The first row exhibits the policy distribution. The length of the red arrowheads denotes the size of the action vector, and the direction of the red arrowheads denotes the direction of actions. The second row shows the value function of each state point.
  • Figure 3: Multi-goal multimodal experiments. We selected 5 points that require multimodal policies: (0, 0), (-0.5, 0.5), (0.5, 0.5), (0.5, -0.5), (-0.5, -0.5), and sampled 100 trajectories for each point. The top row shows the experimental results of DACER, another shows the experimental results of DSAC.
  • Figure 4: Ablation experiment curves. (a) DAC stands for not using the entropy regulator. DACER's performance on Walker2d-v3 is far better than DAC. (b) Adaptive tuning of the noise factor based on the estimated entropy achieved the best performance compared to fixing the noise factor or using the adaptive tuning method with initial, end values followed by a linear decay method. (c) The best performance was achieved with diffusion steps equal to 20, in addition to the instability of the training process when equal to 30.
  • Figure 5: Simulation tasks. (a) Humanoid-v3:$(s \times a) \in \mathbb{R}^{376} \times \mathbb{R}^{17}$. (b) Ant-v3: $(s \times a) \in \mathbb{R}^{111} \times \mathbb{R}^{8}$. (c) HalfCheetah-v3 : $(s \times a) \in \mathbb{R}^{17} \times \mathbb{R}^{6}$. (d) Walker2d-v3: $(s \times a) \in \mathbb{R}^{17} \times \mathbb{R}^{6}$. (e) InvertedDoublePendulum-v3: $(s \times a) \in \mathbb{R}^{6} \times \mathbb{R}^{1}$. (f) Hopper-v3: $(s \times a) \in \mathbb{R}^{11} \times \mathbb{R}^{3}$. (g) Pusher-v2: $(s \times a) \in \mathbb{R}^{23} \times \mathbb{R}^{7}$. (h) Swimmer-v3: $(s \times a) \in \mathbb{R}^{8} \times \mathbb{R}^{2}$.