Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

Liyuan Mao; Haoran Xu; Xianyuan Zhan; Weinan Zhang; Amy Zhang

Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

Liyuan Mao, Haoran Xu, Xianyuan Zhan, Weinan Zhang, Amy Zhang

TL;DR

This work shows that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution, and proposes a novel approach, Diffusion-DICE, that directly performs this transformation using diffusion models.

Abstract

One important property of DIstribution Correction Estimation (DICE) methods is that the solution is the optimal stationary distribution ratio between the optimized and data collection policy. In this work, we show that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution. Based on this, we propose a novel approach, Diffusion-DICE, that directly performs this transformation using diffusion models. We find that the optimal policy's score function can be decomposed into two terms: the behavior policy's score function and the gradient of a guidance term which depends on the optimal distribution ratio. The first term can be obtained from a diffusion model trained on the dataset and we propose an in-sample learning objective to learn the second term. Due to the multi-modality contained in the optimal policy distribution, the transformation in Diffusion-DICE may guide towards those local-optimal modes. We thus generate a few candidate actions and carefully select from them to approach global-optimum. Different from all other diffusion-based offline RL methods, the guide-then-select paradigm in Diffusion-DICE only uses in-sample actions for training and brings minimal error exploitation in the value function. We use a didatic toycase example to show how previous diffusion-based methods fail to generate optimal actions due to leveraging these errors and how Diffusion-DICE successfully avoids that. We then conduct extensive experiments on benchmark datasets to show the strong performance of Diffusion-DICE. Project page at https://ryanxhr.github.io/Diffusion-DICE/.

Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (21 sections, 3 theorems, 42 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 3 theorems, 42 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Distribution Correction Estimation
Diffusion Models in Offline Reinforcement Learning
Diffusion-DICE
An Optimal Policy Transformation View of DICE
In-sample Guidance Learning for Accurate Policy Transformation
Stablizing gradient using piecewise $f$-divergence.
Boost Performance with Optimal Action Selection
Toycase validation.
Experiments
D4RL Benchmark Datasets:
Further Experiments on Error Exploitation
Related Work
Conclusion
...and 6 more sections

Key Result

Lemma 1

Given a random variable $X$ and its corresponding distribution $P(X)$, for any non-negative function $f(x)$, the following problem is convex and its optimizer is given by $y^\ast = \log \mathbb{E}_{x \sim P(X)}[f(x)]$,

Figures (8)

Figure 1: Illustration of the guide-then-select paradigm
Figure 2: Toycase of a 2-D bandit problem. The action in the offline dataset follows a bivariate standard normal distribution constrained within an annular region. The ground truth reward has two peaks extending from the center outward. We use a diffusion model $\hat{\pi}^{\mathcal{D}}$ to fit the behavior policy and a reward model $\hat{R}$ to fit the ground truth reward $R$. Both $\hat{\pi}^{\mathcal{D}}$ and $\hat{R}$ fit in-distribution data well while making error in out-of-distribution regions. Diffusion-DICE could generate correct optimal actions in the outer circle while other methods tend to expolit error information from $\hat{R}$ and only generate overestimated, sub-optimal actions.
Figure 3: Actions generated by the guide-then-select paradigm result in better performance while have less overestimation error.
Figure 4: Although using piecewise $f$-divergence, the Gaussian policy in traditional DICE algorithm still results in inferior performance than using diffusion policy.
Figure 5: $Q(s, a)$ curves for guide-then-select paradigm and select-from-behavior paradigm.
...and 3 more figures

Theorems & Definitions (8)

Lemma 1
Theorem 1
Proposition 1
proof
proof
proof
proof
proof

Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

TL;DR

Abstract

Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (8)