Enhanced DACER Algorithm with High Diffusion Efficiency
Yinuo Wang, Likun Wang, Mining Tan, Wenjun Zou, Xujie Song, Wenxuan Wang, Tong Liu, Guojian Zhan, Tianze Zhu, Shiqi Liu, Zeyu He, Feihong Zhang, Jingliang Duan, Shengbo Eben Li
TL;DR
This work tackles the efficiency barrier of diffusion-based policies in online RL by introducing DACERv2, which guides the reverse diffusion denoising with a Q-gradient field and a time-weighted mechanism. The method combines double Q-learning with distributional DSAC and a soft-entropy constraint, optimizing a joint objective that includes an auxiliary Q-gradient loss to promote multimodal, high-quality actions with only a few diffusion steps. Empirical results on OpenAI Gym MuJoCo tasks show state-of-the-art performance in complex control environments and substantial speedups in both training and inference compared to prior diffusion-based and classical online RL methods, while also enhancing multimodality. The proposed approach improves deployment practicality for diffusion policies in real-time control and opens avenues for integrating gradient-based value guidance with efficient denoising in RL settings.
Abstract
Due to their expressive capacity, diffusion models have shown great promise in offline RL and imitation learning. Diffusion Actor-Critic with Entropy Regulator (DACER) extended this capability to online RL by using the reverse diffusion process as a policy approximator, achieving state-of-the-art performance. However, it still suffers from a core trade-off: more diffusion steps ensure high performance but reduce efficiency, while fewer steps degrade performance. This remains a major bottleneck for deploying diffusion policies in real-time online RL. To mitigate this, we propose DACERv2, which leverages a Q-gradient field objective with respect to action as an auxiliary optimization target to guide the denoising process at each diffusion step, thereby introducing intermediate supervisory signals that enhance the efficiency of single-step diffusion. Additionally, we observe that the independence of the Q-gradient field from the diffusion time step is inconsistent with the characteristics of the diffusion process. To address this issue, a temporal weighting mechanism is introduced, allowing the model to effectively eliminate large-scale noise during the early stages and refine its outputs in the later stages. Experimental results on OpenAI Gym benchmarks and multimodal tasks demonstrate that, compared with classical and diffusion-based online RL algorithms, DACERv2 achieves higher performance in most complex control environments with only five diffusion steps and shows greater multimodality.
