Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

Tianyu Chen; Zhendong Wang; Mingyuan Zhou

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

Tianyu Chen, Zhendong Wang, Mingyuan Zhou

TL;DR

A dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy is introduced by a newly introduced diffusion trust region loss, eliminating the need for iterative denoising sampling during both training and inference.

Abstract

Offline reinforcement learning (RL) leverages pre-collected datasets to train optimal policies. Diffusion Q-Learning (DQL), introducing diffusion models as a powerful and expressive policy class, significantly boosts the performance of offline RL. However, its reliance on iterative denoising sampling to generate actions slows down both training and inference. While several recent attempts have tried to accelerate diffusion-QL, the improvement in training and/or inference speed often results in degraded performance. In this paper, we introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy. We bridge the two polices by a newly introduced diffusion trust region loss. The diffusion policy maintains expressiveness, while the trust region loss directs the one-step policy to explore freely and seek modes within the region defined by the diffusion policy. DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient. We evaluate its effectiveness and algorithmic characteristics against popular Kullback--Leibler divergence-based distillation methods in 2D bandit scenarios and gym tasks. We then show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds. The PyTorch implementation is available at https://github.com/TianyuCodings/Diffusion_Trusted_Q_Learning.

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (42 sections, 1 theorem, 22 equations, 9 figures, 7 tables, 2 algorithms)

This paper contains 42 sections, 1 theorem, 22 equations, 9 figures, 7 tables, 2 algorithms.

Introduction
Diffusion Trusted Q-Learning
Preliminaries
Diffusion Policy
The ELBO Objective
Diffusion Trust Region Loss
Diffusion Trusted Q-Learning
Policy Learning.
Q-Learning.
Comparison of Different Mode-Seeking Behavior Regularizations
Connection and Difference with SDS and SRPO
Experiments
Hyperparameters
D4RL Performance
Computational Efficiency
...and 27 more sections

Key Result

Theorem 1

If policy $\mu_\phi$ satisfies the ELBO condition of Equation eq:elbo, then the Diffusion Trust Region Loss aims to maximize the lower bound of the distribution mode $\underset{\bm{a}_0}{\max} \log p(\bm{a}_0|\bm{s})$ for any given $\bm s$.

Figures (9)

Figure 1: Diffusion trust region loss. The first column shows how the training behavior dataset looks. Columns 2-6 display the diffusion trust region loss on different actions with varying magnitudes of Gaussian noise. We can observe that the trust regions captured by the diffusion model at a given $t$ are consistent with the high-density regions of the noisy data at that specific $t$. For example, when $t$ is small, the diffusion loss is minimal where the true action lies, and high in all other locations.
Figure 2: Green points represent the datasets we trained on. Red points are generated by $\pi_\theta$, trained using $\mathcal{L}_{\text{KL}}$. This demonstrates that the KL loss encourages the generation process to cover multiple modalities of the dataset.
Figure 3: We tested the differential impact of $\mathcal{L}_{\text{TR}}$ and $\mathcal{L}_{\text{KL}}$ on behavior regularization, using a trained Q-function for policy improvement. Red points represent actions generated from the one-step policy $\pi_\theta$.
Figure 4: Training time required for different algorithms in D4RL antmaze-umaze-v0 tasks. All experiments are conducted with the same PyTorch backend and the same computing hardware setup.
Figure 5: Rewards and Gaussian policy entropy during training are recorded and illustrated in the figures. The blue line represents training without the addition of an NLL term, while the orange line indicates training with the NLL term included.
...and 4 more figures

Theorems & Definitions (7)

Theorem 1
proof
Remark 1
Remark 2
Remark 3
Remark 4
Remark 5

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

TL;DR

Abstract

Diffusion Policies creating a Trust Region for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (7)