Table of Contents
Fetching ...

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, Ye Shi

TL;DR

QVPO addresses online reinforcement learning with diffusion policies by deriving a Q-weighted variational lower bound that serves as a tight surrogate for the policy objective. It combines a Q-weight transformation (e.g., qadv) with an entropy regularization term and an efficient action-selection-based behavior policy to enhance exploration and reduce variance. Empirical results on MuJoCo locomotion tasks show that QVPO achieves state-of-the-art performance in both cumulative reward and sample efficiency, outperforming traditional online RL methods and prior diffusion-based approaches. This work advances diffusion-based online RL by providing a principled objective, practical training tricks, and strong empirical validation for continuous control.

Abstract

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL is less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' actions. This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance on both cumulative reward and sample efficiency.

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

TL;DR

QVPO addresses online reinforcement learning with diffusion policies by deriving a Q-weighted variational lower bound that serves as a tight surrogate for the policy objective. It combines a Q-weight transformation (e.g., qadv) with an entropy regularization term and an efficient action-selection-based behavior policy to enhance exploration and reduce variance. Empirical results on MuJoCo locomotion tasks show that QVPO achieves state-of-the-art performance in both cumulative reward and sample efficiency, outperforming traditional online RL methods and prior diffusion-based approaches. This work advances diffusion-based online RL by providing a principled objective, practical training tricks, and strong empirical validation for continuous control.

Abstract

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL is less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' actions. This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance on both cumulative reward and sample efficiency.
Paper Structure (21 sections, 31 equations, 5 figures, 4 tables)

This paper contains 21 sections, 31 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The training pipeline of QVPO. In each training epoch, QVPO first utilizes the diffusion policy to generate multiple action samples for every state. Then, these action samples will be selected and endowed with different weights according to the Q value given by the value network. Besides, action samples from uniform distribution are also created for the diffusion entropy regularization term. With these action samples and weights, we can finally optimize the diffusion policy via the combined objective of Q-weighted VLO loss and diffusion entropy regularization term.
  • Figure 2: A toy example on continuous bandit to show the effect of diffusion entropy regularization term via the changes of the explorable area for diffusion policy with the training procedure. The contour lines indicate the reward function of continuous bandit, which is an arbitrarily selected function with 3 peaks.
  • Figure 3: Learning Curves of different algorithms on 5 Mujoco locomotion benchmarks across 5 runs. The x-axis is the number of training epochs. The y-axis is the episodic reward. the plots smoothed with a window of 5000.
  • Figure 4: Comparison between QVPO with and without the diffusion entropy regularization.
  • Figure 5: Comparison of QVPO with different action selection numbers for behavior policy $K_b$ and for target policy $K_t$.

Theorems & Definitions (2)

  • proof
  • proof