Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Ruoqi Zhang; Ziwei Luo; Jens Sjölund; Thomas B. Schön; Per Mattsson

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Ruoqi Zhang, Ziwei Luo, Jens Sjölund, Thomas B. Schön, Per Mattsson

TL;DR

The paper tackles offline RL by combining entropy-regularized diffusion policies with Q-ensembles to mitigate distribution shifts and value overestimation. It leverages a mean-reverting SDE to map actions to a standard Gaussian and derives a tractable entropy term via posterior sampling, while introducing an efficient 5-step sampling scheme and a lower confidence bound from Q-ensembles for pessimistic learning. The approach achieves strong results on D4RL benchmarks, notably excelling in AntMaze tasks with sparse rewards, and demonstrates improved training stability and robustness through entropy regularization and ensemble-based uncertainty. This work advances offline RL by enabling diverse action exploration without inflating Q-value errors, offering practical benefits for real-world, data-constrained decision-making tasks.

Abstract

This paper presents advanced techniques of training diffusion policies for offline reinforcement learning (RL). At the core is a mean-reverting stochastic differential equation (SDE) that transfers a complex action distribution into a standard Gaussian and then samples actions conditioned on the environment state with a corresponding reverse-time SDE, like a typical diffusion policy. We show that such an SDE has a solution that we can use to calculate the log probability of the policy, yielding an entropy regularizer that improves the exploration of offline datasets. To mitigate the impact of inaccurate value functions from out-of-distribution data points, we further propose to learn the lower confidence bound of Q-ensembles for more robust policy improvement. By combining the entropy-regularized diffusion policy with Q-ensembles in offline RL, our method achieves state-of-the-art performance on most tasks in D4RL benchmarks. Code is available at https://github.com/ruoqizzz/Entropy-Regularized-Diffusion-Policy-with-QEnsemble.

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (49 sections, 1 theorem, 46 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 49 sections, 1 theorem, 46 equations, 9 figures, 11 tables, 1 algorithm.

Introduction
Background
Offline RL.
Mean-Reverting SDE.
Sample Actions with SDE.
Method
Optimal Sampling with Mean-Reverting SDE
Notation Note
Diffusion Policy with Entropy Regularization
Entropy Approximation.
Pessimistic Evaluation via Q-ensembles
Experiment
Setup
Datasets
Implementation Details
...and 34 more sections

Key Result

Proposition 3.1

Given an initial variable ${\bm a}^0$, for any diffusion state ${\bm a}^t$ at time $t \in [1,T]$, the posterior of the mean-reverting SDE eq:t-sde conditioned on ${\bm a}^0$ is which is a Gaussian with mean and variance given by: where $\theta_i^{'} \coloneqq \int_{i-1}^i \theta_t dt$ and $\bar{\theta}_{t}$ is to substitute $\bar{\theta}_{0:t}$ for clear notation.

Figures (9)

Figure 1: A toy RL task in which the agent sequentially takes two steps (starting from 0) to seek a state with the highest reward. Left: The reward function is a mixture of Gaussian, and the offline data distribution is unbalanced with most samples located in low-reward states. Center: Training different policies on this task with 5 random seeds for 500 epochs. We find that a diffusion policy with entropy regularization and Q-ensembles yields the best results with low training variance. Right: Learned Q-value curve for the first step actions in state 0. The approximation of the lower confidence bound (LCB) of Q-ensembles is also plotted.
Figure 2: Comparison of the reverse-time SDE and optimal sampling process in data reconstruction.
Figure 3: Ablation experiments of our method with different values of LCB coefficient $\beta=1,2,4$ on AntMaze-Medium environments over 5 different random seeds.
Figure 4: Learning curves of the Diffusion-QL and our method on selected Antmaze tasks over 5 random seeds.
Figure 5: Visualization of the workings of the mean-reverting SDE for action prediction. The SDE models the degradation process from the action from the dataset to a noise. By guiding the policy with corresponding reverse-time SDE and the LCB of Q, a new action is generated conditioned on the RL state.
...and 4 more figures

Theorems & Definitions (5)

Proposition 3.1
proof
proof
proof
proof

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

TL;DR

Abstract

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (5)