Table of Contents
Fetching ...

Enhanced DACER Algorithm with High Diffusion Efficiency

Yinuo Wang, Likun Wang, Mining Tan, Wenjun Zou, Xujie Song, Wenxuan Wang, Tong Liu, Guojian Zhan, Tianze Zhu, Shiqi Liu, Zeyu He, Feihong Zhang, Jingliang Duan, Shengbo Eben Li

TL;DR

This work tackles the efficiency barrier of diffusion-based policies in online RL by introducing DACERv2, which guides the reverse diffusion denoising with a Q-gradient field and a time-weighted mechanism. The method combines double Q-learning with distributional DSAC and a soft-entropy constraint, optimizing a joint objective that includes an auxiliary Q-gradient loss to promote multimodal, high-quality actions with only a few diffusion steps. Empirical results on OpenAI Gym MuJoCo tasks show state-of-the-art performance in complex control environments and substantial speedups in both training and inference compared to prior diffusion-based and classical online RL methods, while also enhancing multimodality. The proposed approach improves deployment practicality for diffusion policies in real-time control and opens avenues for integrating gradient-based value guidance with efficient denoising in RL settings.

Abstract

Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, achieving state-of-the-art performance. However, it still suffers from a core trade-off: more diffusion steps ensure high performance but reduce efficiency, while fewer steps degrade performance. This remains a major bottleneck for deploying diffusion policies in real-time online RL. To mitigate this, we propose DACERv2, which leverages a Q-gradient field objective with respect to action as an auxiliary optimization target to guide the denoising process at each diffusion step, thereby introducing intermediate supervisory signals that enhance the efficiency of single-step diffusion. Additionally, we observe that the independence of the Q-gradient field from the diffusion time step is inconsistent with the characteristics of the diffusion process. To address this issue, a temporal weighting mechanism is introduced, allowing the model to effectively eliminate large-scale noise during the early stages and refine its outputs in the later stages. Experimental results on OpenAI Gym benchmarks and multimodal tasks demonstrate that, compared with classical and diffusion-based online RL algorithms, DACERv2 achieves higher performance in most complex control environments with only five diffusion steps and shows greater multimodality.

Enhanced DACER Algorithm with High Diffusion Efficiency

TL;DR

This work tackles the efficiency barrier of diffusion-based policies in online RL by introducing DACERv2, which guides the reverse diffusion denoising with a Q-gradient field and a time-weighted mechanism. The method combines double Q-learning with distributional DSAC and a soft-entropy constraint, optimizing a joint objective that includes an auxiliary Q-gradient loss to promote multimodal, high-quality actions with only a few diffusion steps. Empirical results on OpenAI Gym MuJoCo tasks show state-of-the-art performance in complex control environments and substantial speedups in both training and inference compared to prior diffusion-based and classical online RL methods, while also enhancing multimodality. The proposed approach improves deployment practicality for diffusion policies in real-time control and opens avenues for integrating gradient-based value guidance with efficient denoising in RL settings.

Abstract

Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, achieving state-of-the-art performance. However, it still suffers from a core trade-off: more diffusion steps ensure high performance but reduce efficiency, while fewer steps degrade performance. This remains a major bottleneck for deploying diffusion policies in real-time online RL. To mitigate this, we propose DACERv2, which leverages a Q-gradient field objective with respect to action as an auxiliary optimization target to guide the denoising process at each diffusion step, thereby introducing intermediate supervisory signals that enhance the efficiency of single-step diffusion. Additionally, we observe that the independence of the Q-gradient field from the diffusion time step is inconsistent with the characteristics of the diffusion process. To address this issue, a temporal weighting mechanism is introduced, allowing the model to effectively eliminate large-scale noise during the early stages and refine its outputs in the later stages. Experimental results on OpenAI Gym benchmarks and multimodal tasks demonstrate that, compared with classical and diffusion-based online RL algorithms, DACERv2 achieves higher performance in most complex control environments with only five diffusion steps and shows greater multimodality.

Paper Structure

This paper contains 33 sections, 1 theorem, 27 equations, 13 figures, 6 tables.

Key Result

Theorem 1

Let $\mathcal{S}$ denote the state space and $\mathcal{A}$ denote the continuous action space. Suppose $p(s)$ is a distribution over states, $\mathcal{H}_0^{global}$ denotes a specific entropy value. We define the policy space $\Pi_{\mathcal{H}_0^{global}}$ as the set of policy families $\{\pi^*(\cd where $\mathcal{H}_0^{global}$ is a given constant. Within the policy space $\Pi_{\mathcal{H}_0^{gl

Figures (13)

  • Figure 1: Efficiency and Performance. The horizontal axis represents the training or inference time (increasing from right to left), while the vertical axis shows the normalized Total Average Return (TAR). The training time is the per-step computational cost on OpenAI Gym tasks, excluding the time spent on environment interaction. The inference time is measured as the latency required for the policy network to output an action given a single state as input. DACERv2 achieve outstanding performance.
  • Figure 2: Multi-goal Task. Trajectories generated by policies learned using our method (top row) and original DACER (bottom row) are shown, with the $x$-axis and $y$-axis representing 2D positions (states). The agent is initialized at the origin, and the goals are marked as red dots. The level curves indicate the reward, and reaching within 1 of the endpoint signifies task completion. Results are shown for 4, 5, and 6 goal configurations from left to right.
  • Figure 3: Training curves on benchmarks. The solid lines represent the mean, while the shaded regions indicate the 95% confidence interval over five runs. For PPO, iterations are defined by the number of network updates.
  • Figure 4: Ablation experiment curves. (a) The performance of DACERv2 with Q-gradient function on Walker2d-v3 is far better than without Q-gradient function. (b) Time-weighted mechanism can further improve the performance of our algorithm. (c) A diffusion step size of 5 provides a balance between efficiency and performance.
  • Figure 5: Walker2d-v3
  • ...and 8 more figures

Theorems & Definitions (1)

  • Theorem 1